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Abstract 

In clinical practice, physicians make a series of treatment decisions over the course of a patient's 
disease based on his/her baseline and evolving characteristics. A dynamic treatment regime is a set of 
sequential decision rules that operationalizes this process. Each rule corresponds to a key decision point 
and dictates the next treatment action among the options available as a function of accrued information 
on the patient. Using data from a clinical trial or observational study, a key goal is estimating the 
optimal regime, that, if followed by the patient population, would yield the most favorable outcome on 
average. Q-learning and advantage (j4-)learning are two main approaches for this purpose. We provide 
a detailed account of Q- and j4-learning and study systematically the performance of these methods. 
The methods are illustrated using data from a study of depression. 

1 Introduction 

In the health sciences, an area of considerable current interest is personalized medicine, which involves 
making treatment decisions for an individual patient using all information available on the patient, including 
genetic, physiologic, demographic, and other clinical variables, to achieve the "best" outcome for the patient 
given this information. In treating a patient with an ongoing disease or disorder, a clinician makes a series 
of decisions based on the patient's evolving status, so seeking to tailor treatment to the patient. A dynamic 
treatment regime is a list of sequential decision rules that formalizes this process. Each rule corresponds to 
a key decision point in the disease or disorder progression and takes as input the information on the patient 
to that point and outputs the treatment that s/he should receive from among the available options. A key 
step toward personalized medicine is thus finding the optimal dynamic treatment regime, that which, if 
followed by the entire patient population, would yield the most favorable outcome on average. 
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The statistical problem is to estimate the optimal regime based on data from a clinical trial or observa- 
tional study. Q-learning (Q denoting "quality," Watkins, 1989; Watkins and Dayan, 1992; Nahum-Shani 
et al., 2010) and advantage learning (yl-learning, Murphy, 2003; Blatt, Murphy, and Zhu, 2004) are two main 
approaches proposed for this purpose. Both follow from developments on reinforcement learning methods 
for sequential decision-making in the computer science literature. As described shortly, Q-learning is based 
roughly on posited regression models for the outcome of interest given patient information at each decision 
point and is implemented through a backwards (in time) recursive fitting procedure that is related to 
the dynamic programming algorithm (Bather, 2000), a standard approach for deducing optimal sequential 
decisions. ^4-learning involves the same recursive strategy, but, instead of requiring full regression relation- 
ships to be posited, requires only models for the part of the outcome regression involved in representing 
contrasts among treatments along with models for the probability of observed treatment assignment given 
patient information at each decision point. As discussed in the sequel, this feature may make ^4-learning 
more robust to model misspecification than Q-lcarning for consistent estimation of the optimal treatment 
regime. 

Examples of the use of Q- and ^4-learning and related methods to deduce optimal strategies for treatment 
of substance abuse, psychiatric disorders, cancer, and HIV infection and for dose adjustment in response 
to evolving patient status are given by (e.g., Rosth0j et al., 2006; Murphy et al., 2007a, b; Zhao, Kosorok, 
and Zeng, 2009; Henderson, Ansell, and Alshibani, 2010). Related work includes Thall, Millikan, and 
Sung (2000), Thall, Sung, and Etscy (2002), Robins (2004), Moodic, Richardson, and Stephens (2007), 
Thall et al. (2007), Robins, Orellana, and Rotnitzky (2008), Almirall, Ten Have, and Murphy (2010) and 
Orellana, Rotnitzky, and Robins (2010). 

Despite increasing interest in estimation of optimal dynamic treatment regimes, there has been little 
study of the relative merits of Q- and ^4-learning, nor of consequences of misspecification of the postulated 
models involved. Moreover, although descriptions of Q- and yl-learning are available, a self-contained 
account of both has not been presented. In this article, we provide a detailed description of an appropriate 
statistical framework in which an optimal regime may be defined formally and introduce Q- and ^-learning 
in this context. Conditions under which these methods may be expected to yield credible estimators for 
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optimal regimes based on observed data are discussed, and we report on a systematic study of the methods' 
performance. 

Section 2 introduces the statistical framework, and Section 3 makes precise the form of an optimal 
regime. We describe and contrast Q- and ^4-learning in Section 4 and present extensive simulations evaluat- 
ing their performance, including under model misspecification, in Section 5. The methods are demonstrated 
using data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D, Rush ct al., 2004) 
study in Section 6. 

2 Framework and Assumptions 

We consider the general setting of K prespecified, ordered decision points, indexed by k = 1, . . . , K , which 
may be times or events in the disease or disorder process that necessitate a treatment decision, where, at 
each point, a set of treatment options is available. Assume that there is a final outcome Y of interest for 
which, without loss of generality, large values are preferred. The outcome may be ascertained following 
the Kth decision, as in the case of CD4 T-cell count at a prespecified follow-up time in a study of HIV 
infection (Moodie et al., 2007); or may be a function of information accrued over the entire sequence of 
decisions, as in Henderson et al. (2010), where outcome is the overall proportion of time a measure of blood 
clotting speed is kept within a target range in a study of dosing of anticoagulant agents. 

In order to define an optimal treatment regime and discuss estimation of an optimal regime based on 
data from an observational study or clinical trial, we first define a suitable conceptual framework. For 
simplicity, our presentation is heuristic. We imagine that there is a superpopulation of patients, denoted 
by f2, where one may view an element u € O as a patient from this population. We assume that patients 
in the population have been treated and otherwise have behaved according to routine clinical practice for 
the disease or disorder prior to the first treatment decision. Consequently, immediately prior to this first 
decision, patient lo would present to the decision-maker with a set of baseline information (covariatcs) 
denoted by the random variable Si', we discuss this further below. Thus, Si(u>) is the value of his/her 
information immediately prior to decision 1 under these conditions, taking values s\, say, in a set <5>i. 
Assume that, at each decision point k — 1,...,K, there is a set of possible treatment options Au, where 
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we denote elements of Ak by ak- We write a,k — (ffli, • • • , a>k) to denote a possible treatment history that 
could be administered through the kth decision, taking values in the corresponding set Ak — A\ x ■ • • x Ak- 
Thus, Ak denotes the set of all possible full treatment histories clk through all K decisions. 
We then define the potential outcomes (Robins, 1986) 

W = {S* 2 ( ai ), 5*(a 2 ), . . . , SE(o fc _i), . . . , S* K (a K ^),Y*(a K ) for all a K e A K }. (1) 

In (1), SjJ(afc_i)(w) denotes the value of covariatc information that would arise between the (fc — l)th and 

fcth decision points for a patient to G Q under the hypothetical situation that s/he were to have received 

previously treatment history ak-i, taking values Sk in a set Sk, k = 2,...,K. Similarly, Y* (o,k)(u}) is 

the hypothetical outcome that would result for patient uj were s/he to have been administered the full 

set of K treatments in ax- Here and henceforth, this notation implies that, for random variables such 

as SZ(ak-i), Sfe-i is an index representing prior treatment history. For convenience, write S k (a,k-i) = 

{Si, S^ai), . . . , S%.(a,k-i)} , k = 1, . . . ,K, where S^ak-i)^) takes values s~k in S\ = S\ x • • • x Sk.; note 

that this definition includes the baseline covariate Si and is taken equal to Si when k = 1. In what follows, 

for simplicity, we take all random variables to be discrete, but the results we present hold more generally. 

(p) (p) 

Let the random variables A\ , . . . , A K denote the treatments that would be assigned to patients in 

( p) 

the population at decisions 1, . . . , K under routine clinical practice, so that ^ (ui) is the treatment in Ak 

that patient ui would receive at decision k, taking values ak G Ak- By routine clinical practice, we mean 

the conditions under which patients in the population and their providers would make treatment decisions 

acting as they see fit, emphasized by the superscript (P) (for "population"), to be distinguished from 

those of a clincial trial, discussed later. Thus, the A! characterize the mechanism by which treatments 

are assigned in the population if patients and clinicians are left to their own devices. Likewise, define 
(p) 

the random variables S^, , k — 2, ...,iv, to be the covariate information that would be observed on 

(p) 

patients in the population between decisions k — 1 and k under the treatment assignments A k , taking 
values Sk G Sk', let y( p ) be the corresponding observed outcome, taking values y in a set J- 7 ; and define 
= (j4| , . . . , A^ ) , taking values € Ak ■ Henceforth, as is standard, we make the consistency 
assumption (e.g., Robins, 1994) that the covariates and outcomes that would be observed under these 
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conditions are those that potentially would be seen under the treatments actually received; that is, for 
patient we!!, S ( k P) (u) = S£{lj£\(u;)}(u;), k = 2, . . . , K, and Y^{lo) = Y* {A [ p '(w) }(<*>)• We also make 
the stable unit treatment value assumption (Rubin, 1978), which ensures that a patient's covariates and 
outcome are unaffected by how treatments are allocated to her/him and other patients. 

Under this conceptualization, probabilities for events in are induced by random sampling from this 
population, as are all probability distributions of the potential data above and observed data that would 
be obtained from studies carried out in the population. The goal of Q- and ^-learning is to estimate 
the optimal treatment regime based on data from an observational study or clinical trial carried out in a 
random sample from this population. 

A dynamic treatment regime d = (di, . . . , dx) is a set of rules that dictates an algorithm for treating 
a patient over time. At the kth decision point, the fcth rule dfc(sfc, a/c-i)j say, takes as input the patient's 
realized covariate and treatment history prior to the fcth treatment decision and outputs a value ak € 
\l/fc(sfc, <Xfc_i) C Ak', for k = 1, there is no prior treatment (a is null), and we write di(si) and 'J'i(si). 
Here, ^(s^, dk-i) is the set of feasible treatment options for a patient with realized history (sfc,afc_i), 
reflecting that some treatment options may be unethical or impossible for patients with certain histories. 
We discuss considerations for identifying the ^k(sk, Sfe-i) shortly. Because we consider only regimes where 
dk(sk,ak-i) € 4'fe(sfe,afc_i) C Ak, dk need only map a subset of Sk x Ak-i to Ak- We define these subsets 
recursively as 

Tfc = |(sfc,afc-i) e S k x Ak-x satisfying 

(i) aj etfjfo.aj-i), i = l,...,fc-l, and (ii) pr{^(a fe _!) =s fc } >o} (2) 

for k = 1, . . . , K. Thus, we may define formally the class of all feasible treatment regimes V, say, as 
the set of all d = (di, . . . , dx) for which dk, k = 1,...,K, is a mapping from Tk into Ak satisfying 
d k (s k , Ofc-i) € *fc(sfc,a fe _i) for every (s fc ,Ofc_i) € T fc . 

Intuitively, an optimal regime should represent the "best" way to intervene to treat patients in f2 who 
would otherwise behave according to routine clinical practice. We now state with specificity what we mean 
by this. To this end, for any d € T>, writing dk = (di, . . . , dk), k = 1, . . . , K, dpc — d, define the potential 
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outcomes {S^dx), . . . , S^dk-i), ■ • ■ , Sj C (d,K-i), Y*(d)} associated with a regime d & T> such that, for any 
u> e f2, with Si(lu) = si, 

di(si) = «i, S , 2(di)(o;) = SKux)^) = s 2) d 2 (s 2 , "i) =u 2 ,.. ■ , d K _x(sK-i, u K -2) = u K -i, 

S* K {d K -i){u) = S* K (u K -x)(u>) = SK, d K (s K ,u K ~x) = u K . Y*(d)(uj) = Y*(u k )(uj) = y. (3) 

The index dk~i emphasizes that S^,(dk-i)(uj) represents the covariate information that would arise between 
decisions k — 1 and k were patient uj to receive the treatments sequentially dictated by the first k — 1 rules in 
d. Similarly, Y*(d)(ui) is the final outcome that uj would experience if s/he were to receive the K treatments 
dictated by d. 

With these definitions, the expected outcome in the population if all patients with initial state Si = si 
were to follow regime d is E{y*(<i)|S , i = si}. An optimal regime, d opt £ T>, say, satisfies 

E{Y*(d)\S! = sx} < E{y*(d opt )|5i = si} for all d € V and all s x € S t . (4) 

In Section 3, we give the form of d opt satisfying (4) and demonstrate further optimality properties. 

Of course, potential outcomes for a given patient for all d € T> are not observed. Thus, the goal is to 
estimate d opt in (4) using data from a study carried out on a random sample of n patients from £1 that record 
baseline and evolving covariate information and the treatments actually received by the participants. We 
denote the available study data as independent and identically distributed (i.i.d.) time-ordered random 
variables (Su, An, . . . , Sxi, Axi 1 Yi) i = 1, . . . , n on fi. Here, Si is as before; Sk, k = 2,...,K, is the 
covariate information recorded between decisions k — 1 and fc, taking values Sk G Sk] A/., k = 1, . . . , K, 
is the recorded, observed treatment assignment, taking values au £ Ak', and Y is the observed outcome, 
taking values y e y. As above, we define S\ = [Si, ■ ■ ■ ,Sk) and Ak — {Ax, . . . , A},), k = 1, . . . , K, taking- 
values Sk S §k and a,k & Ak- 

It is important to recognize that the nature of the study generating the available data must be consid- 
ered carefully. If the data arise from an observational study in which covariate, treatment, and outcome 
information on n participants randomly sampled from is recorded, with no intervention by investiga- 
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tors, then it is reasonable to assume that the mechanism by which treatments are assigned to the patients 

in the sample during the study is the same as that for the entire population under routine practice. In 

(p) 

this case, for k = 1,...,K, A k — A k , so that, under the consistency assumption, for k = 2,...,K, 
S k (uj) = S ( k P) (uj) = S* k {A k P \(uj)}(uj), and Y(u) = Y^{u) = Y*{A { P (w)}(w). Here, the form of 
ty k (s k , <ifc_i), k — 1, . . . ,K, is determined by treatment choices dictated by clinical practice. 

Such a correspondence between the Sk , A k and Si , A k is not the case for an intervention study. 
A clinical trial design that has been advocated for collecting data suitable for estimating optimal treat- 
ment regimes is that of a so-called sequential multiple-assignment randomized trial (SMART, Lavori and 
Dawson, 2000; Murphy, 2005). In a SMART involving K pre-specified decision points, each participant is 
randomized at each decision point to one of a set of feasible treatment options, where, at the fcth decision, 
the randomization probabilities may depend on past realized information s k ,a k ^\. As we discuss further 
shortly, as with any clinical trial, an advantage is that the usual issues of confounding associated with an 
observational study are obviated. However, the treatment assignment mechanism in the study is no longer 
the same as that in the population under routine practice. More precisely, the sample space is now f2 x Ak, 
where for any element (uj x clk), u> represents the patient randomly sampled from the population, and cLk 
represents the treatments assigned to her/him at all K decisions by the random mechanism dictated by 

the trial design. Here, then, the observed A k (ui x clk) — o-k and Sk(w x a k ) = S%,(ak—i){<*>)- Thus, in 

( p) ( p) 

contrast to an observational study, A k ^ A k and S k ^ S k in general. Moreover, the treatment options 
in ^(s/c, Sfe-i) are dictated by the trial design so may be different from those in routine practice. In 
particular, the set of treatment options at each decision might be restricted relative to those available in 
clinical practice for reasons of logistics, cost, or interest of the trial sponsor in only certain products. We 
discuss further considerations for using data from a SMART to estimate optimal regimes in Section A. 2 of 
the Appendix. 

In order to use the observed data from either type of study to estimate an optimal regime, the critical 
assumption of no unmeasured confounders, also referred to as the sequential randomization assumption 
(Robins, 1994), must be satisfied. A version of this assumption states that A k is conditionally independent 
of W given {Sk,A~k-i}, k = 1,...,K, where A Q is null, written A k X W\S k ,A k _i. In a SMART, this 
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assumption is automatically satisfied by design. In an observational study, this assumption is unverifiable 
from the observed data. Although in the population patients and their providers may make treatment 
decisions based on past covariate information available to them, the issue is whether or not all of this 
information is recorded in the Sk', see Section A. 2 of the Appendix. 

3 Defining the Optimal Treatment Regime 

Q- and ^-learning are two approaches to estimating d opt satisfying (4) under the foregoing framework and 
assumptions. Both involve similar recursive fitting algorithms; the main distinguishing feature is the form 
of the respective underlying models. To appreciate the rationale for the methods, one must first understand 
how d opt is determined via dynamic programming, also referred to as backward induction. We demonstrate 
the formulation of d opt in terms of the potential outcomes and then show how d opt may be expressed in 
terms of the observed data under assumptions including those of the last section. In the following, we 
sometimes highlight dependence on specific elements of quantities such as a k , writing, for example, a k as 
(a k -i,a k ). 

At the Kih decision point, for any sk G Sk, a>K-i G Ak-i for which {sk,o-k-i) G ^k, define 

arg max E{Y*(a K -i, a K )\S K (a K ^i) = s K }, (5) 

o.k 6*if (sjf ,a K -l) 

max E{Y*(a K _i,a K )\S K (a K -i) =s K }. (6) 
For k = K—l, ... ,1 and any s k g S k , d k -i G A k -i for which (s k ,a k -i) G T k , let 

4 1)opt (^' 5 fc-i) = ar S max _ E[y^! ) 1 {s fc! S , ^ +1 (a fc _i,as ! ),afe_i,aA : }|5fc(afc_i) = s k ], (7) 

afce*fe(sfc,dfc-l) 

V k (s k ,a k -i) = max E[V r fe ( 7 1 {s fe , S% +1 (aie-i, a k ), a*_i, a k }\S k (a k -i) = s k ], (8) 

so that, for si € Si, d[ 1)op \si) = argmax aie * l(si) E[V" 2 (1) {si, S£(ai), a x }|5i = si], and V^\si) = 
max aie*i(si) E[V 2 {si, S£(ai), ai}|5i = si]. Note that the above conditional expectations are well-defined 
by condition (ii) in (2) defining IV 



,(l)opt/_ - \ 

% {s K ,a K -i) 



vjp{sKia K -i) 
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It is clear that S 1 ^ = (d[ 1)op \ d^ )opt ) defined above is a treatment regime, as it comprises a set 
of rules that uses patient information prior to each decision to assign treatment from among the feasible 
options. The superscript (1) indicates that d^ opt provides a set of K rules for a patient presenting prior 
to decision point 1 with baseline information S\ — s\. Note that dW opt is defined in a backward iterative 
fashion. At the Kth decision, (5) gives the treatment among the feasible options at decision K that 
maximizes the expected potential final outcome given the prior potential information available, and (6) 
is the maximum value achieved. At decisions k = K — 1,...,1, intuitively, (7) gives the treatment that 
maximizes the expected outcome that would be achieved if subsequent optimal rules already defined were 
followed henceforth. 

In Section A.l of the Appendix, we provide a formal argument demonstrating that d^° pt defined in 
(5)-(8) is an optimal treatment regime in the sense of satisfying (4). Note that, because (4) is true for any 
si, in fact E{Y*(d)} < E{F* (d (1 *> opt )} for any deV. Thus, from a policy perspective, d^ ^ defines the 
optimal strategy for treating patients in the population through all K decisions were they to be encountered 
at the stage of the disease or disorder that precedes decision point 1. 

In routine clinical practice, however, patients may be encountered at later stages. Consider a patient 
oj G SI for whom the first i — 1 treatment decisions have been made as seen fit by her/him and her/his 
provider, t = 2, . . . , K. Immediately prior to the £th decision, the patient would have past history (uj) = 
S£, A\_i(u)) — a^_i, raising the issue of how best to intervene to treat such a patient henceforth, from the 
£th to Kth decisions. That is, we desire rules d k e \sk, ak-i), k = £, £ + 1, . . . , K, say, that dictate how to 
treat such patients. 

Write d^ = (djpjd^h, . . . , d% ) to denote regimes starting at the £ih decision point. Analogous to the 
above, we define the class T>^ of all such feasible regimes to be the set of all d^ for which d!^\sk, = 
for (s fe , d k -i) e Ifc and a k G ^k(sk, Sfe-i) for k = £,... ,K, where 

r if } = {(sfc,a fc _i) G S k x Ak-i satisfying 

(i) aj G ^-(s^a^-i), j = £,..., k-1, and (ii) pr{V Lk } > j, 

Vim is the event {S\ = s e , A}_[ = a^_i, S* l+1 (dt) = s e+ i, . . . , Sl(a k -i) = s k }. 
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Then, by analogy to (4), we seek d^ opt satisfying 

E{Y*(a l -. 1 ,<t e >)\SP = 8 t ,A£\ = a^} < E{F* (a^_i , d^ opt ) |»§j P ' = h,a[ p _\ = a*_i} (9) 

for all € T>^ and s e 6 Sg, a^_! e for which pr(^ P ^ = Si,Aj^_\ — a^_i) > 0. Viewing 

this as a problem of making K — £ + 1 decisions at decision points £,£ + 1,...,K, with initial state 
= se,A\_ x = ai—\, by an argument analogous to that in Section A.l of the Appendix for £ — 1 and 
initial state Si — s\, it may be shown that d^ opt satisfying (9) is given by 

d^ )opt (sK,a K _i) = arg max E{Y*(a,K-i,a K )\Vi,K}, (10) 

a/f 6*jf(sjc,ajf-l) 

Vir (sk-jOk-i) = max E{Y*(a K -x,a K )\Vi <K } (H) 

for any <E <Sr-, «a'-i G Ak-i for which (s^, a^r-i) € T^; and, for A; = K — 1, ... 

d^ opt (sk,a k -i) = arg max E[F fc ( ;i\{sfc, Sl +1 (a k -i, a k ), a k -i, a k }\Ve, k ], (12) 

a fc g* fc (s fc ,a fc _ 1 ) 

(*k>afc-i) = max E[V r fe ( ^ ) 1 {s fc ,5fc +1 (a fe _i,a fc ),aA ; _i,a fc }|V£. fc ] (13) 

for any £ <Sfc, afc_i € -Afc-i for which (s k ,a k -i) G T^', so that 

df )opt {si,a,£-i) = arg max ~E[V^{si, S* l+1 {ae-i, a e ), at-i, a f }|^ (P) = s^A^ = a f _i]. 

ai£^e(si ,ai-i) 

Comparison of (5)-(8) to (10)-(13) shows that the £th to Kth rules of the optimal regime dW° pt that 
would be followed by a patient presenting at the first decision are not necessarily the same as those of the 
optimal regime d^ opt that would be followed by a patient presenting at the £th decision. In particular, 
noting that the conditioning sets in (5)-(8) are V\^k and the rules are ^-dependent through dependence 
of the conditioning sets Vt lk , £ = 1,...,K, k =£,..., K, on £. However, we demonstrate shortly that these 
rules coincide under certain conditions. 

The foregoing developments define optimal regimes in terms of potential outcomes. To be useful in 
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practice, an optimal regime must be defined in terms of the observed data. To this end, define 



Qk(sk,cik) = V(Y\S K = s K ,A K =a K ), (14) 
^^{sk^k-i) = arg max Qk(sk, a K -i, a K ), (15) 
Vk(sk,o.k-i) = max Q K (s K ,a K -i,a K ), (16) 

aK&^K (sk ,hk-i) 

for any sk € Sk, &k € for which pr(5#- = Sr-, Ak-x = &a'-i) > 0; and for k = K — 1, . . . , 1, 

Qk(sk,a k ) = E{V k+ i(s k , S k +i,a k )\S k — s k ,A k = a k } (17) 
rffc Pt («fc,a/c-i) = arg max Q k (s k ,ak-i,a k ), (18) 
Vfc(s fc ,a fc _i) = max Qk(sk,a>k-i,a>k), (19) 

a fc e*fc(sfc,a fc _ 1 ) 

for any G S k ,a k € *4fc for which pr(5fc = Sfe,Afe_i = a^-i) > 0. Note that all quantities in (14)-(19) 
are expressed entirely in terms of the distribution of the observed data. 

In Section A. 2 of the Appendix, under the consistency and sequential randomization (no unmeasured 
confounders) assumptions, along with positivity assumptions on probabilities associated with events in- 
volving Sk, Ak and S K ,A y K ' given in Section A. 2 of the Appendix, we show that 

if> = r fc , (20) 

4f )OPt ( 5 fc,afc-i) = dl pt (s k ,a k -i), (21) 
V k ] \s k ,a k -i) = V k (sk,a k -i), (22) 



for (s k ,a k -i) € Tfe for I = 1, . . . , K and k = £,..., K. The equivalence in (20)-(22) not only demonstrates 
that an optimal treatment regime can be obtained using the distribution of the observed data but also that 
the corresponding rules dictating treatment do not depend on I under these assumptions. Thus, (20)-(22) 
imply that the single set of rules d opt = (<i° pt , . . . ,d°^ ) defined in (15) and (18) is relevant regardless of 
when a patient presents. That is, treatment at the £th decision point for a patient who presents at decision 
1 and has followed the rules in d° pt to that point would be determined by d° pt evaluated at his/her history 
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up to that point, as would treatment for a subject presenting for the first time immediately prior to decision 
t. 

The Qk(sk,a>k) in (14) and (17) are referred to as the "Q-functions," viewed as measuring the "quality" 
associated with using treatment Ofc at decision k given the history up to that decision and then following 
the optimal regime thereafter. The "value functions" Vfe(sj;,afc_i) in (16) and (19) reflect the "value" of a 
patient's history s^, a^-i assuming that optimal decisions are made in the future. 

It is worth noting that there may not be a unique d opt . At any decision point k, if there is more than one 
feasible treatment option leading to the maximum value of the Q-function, then any rule c^ pt yielding 
one of these defines an optimal regime. 

4 Q- and ^-Learning 

4.1 Q-Learning 

From (15) and (18), the optimal regime d opt is defined in terms of the Q-functions (14), (17). Thus, 
estimation of d opt based on i.i.d. data (Su, An, . . . , Sxi, A^i, Y"j), i — 1, . . . , n, may be accomplished via 
direct modeling and fitting of the Q-functions. This is the approach underlying Q-learning. Specifically, 
one may posit models Qk(sk, Qfc! £fc)> say, for k = K, K — 1, . . . , 1, each depending on a finite-dimensional 
parameter The models may be linear or nonlinear in ^ and include main effects and interactions in 
the elements of and a^. 

Estimators may be obtained in a backward iterative fashion for k = K, K — 1, . . . , 1 by solving 
suitable estimating equations [e.g., ordinary (OLS) or weighted (WLS) least squares]. Assuming the latter, 
for k = K, letting V(K+i)i = Y%, one would first solve 

^ K Q£ K Kl, ^ K ^ K 1 (^Ki,A Ki ){V^ K+1 ) i - Qk ( §Ki 7 A~Ki ',£,!<)} = (23) 

in to obtain where T<k(sk, 5>k) is a working variance model. Substituting the model Qk{sk, o>k] (,k) 
in (15) and accordingly writing d°^ t {sK,0'K~i\£,K)^ substituting \k for yields an estimator for the 
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optimal treatment choice at decision K for a patient with past history Sk — sk, Ak—i = &k-i- With 
in hand, one would form for each i, based on (16), V Ki = max a K e<s K (s Ki ,A ( _ K _ 1)i ) Qk{Sk%, A~(K-i)i> a K ;^ K ). 
To obtain £k_i> setting k = K — 1, based on (17), one would then solve in £ k 

E dQk[S % Akl]ik) K\~S kl ,A kl ){V {k+l)l Q k {S ki ,A ki ^ k )} = 0, (24) 
i=i °^ fc 

where Y> k (s k ,a k ) is a working variance model. The corresponding d^ > l 1 (s^-_i, 5k:_2; £k"-i) then yields 
an estimator for the optimal treatment choice at decision K — 1 for a patient with past history Sk-i — 
Sk— l) Ak— 2 = assuming s/he will take the optimal treatment at decision if. One would continue this 

process in the obvious fashion for k = K—2, . . . , 1, forming V ki = max 0) , e ^ fc (s fc4)J 5 (fc _ 1)( ) Qk{S k %, A( fe _ 1)j; , a fe ; £ fe ), 
and solving equations of form (24) to obtain and corresponding G?°, pt (sfc, £&). 

We may now summarize the estimated optimal regime as dg pt = (d ^, ■ ■ • , ^q P k)' where 

d Q P i( s i) = rf i Pt ( s i;Ci): ■ ■ ■ : rf Q Pt fe(sfe,a fe _i) = d£ pt (sfc,a fc _i;£fc), fc = 2, ...,K. (25) 

It is important to recognize that the estimated regime (25) may not be a credible estimator for the true 
optimal regime unless all the models for the Q-functions are correctly specified. 

We illustrate for the case K = 2, where at each decision there are two feasible treatment options coded 
as and 1; i.e., ^i(si) = Ai = {0,1} for all s% and #2(32,0,1) = A 2 = {0,1} for all s 2 and a% € {0,1}. 
Let Hi = (l,sJ) T and H 2 = (l,sJ,ai,s$) T . As in many modeling contexts, it is standard to adopt linear 
models for the Q-functions; accordingly, consider the models 

Qi(si,ai;£i) = Hi 0i + ai(Hiil>i), Q 2 {s2, a 2 ; &) = + a 2 (H^ 2 ), (26) 

where = (/3j \ipk) T f° r k = 1,2. Note that Q 2 (s 2 ,a2;£ 2 ) in (26) is a model for F,(Y\S 2 = s 2 ,A 2 — a 2 ), 
which is a standard regression problem involving observable data, whereas <5i(si, 01; £,1) is a model for the 
conditional expectation of V2(s 2 > a l) = max a 2 e{o.i} E(y|5 2 = s 2 ,Ai — 01,^2 = a 2 ) given Si = si and 
A\ = ai, which is an approximation to a complex relationship involving a maximization. Under (26), it is 
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straightforward to deduce that V 2 (s2,ai;£ 2 ) = niax O2£ { ,i} Q2(s2, ai, a<i\ £2) = 'H^^+C^J V^KCH-fV^ > 0) 
and Vi(si;£i) = max ni g{ Qi(si, en; £1) = "H-TPi + O^iViKO^i'V'i > 0)- Substituting the Q-functions 
in (26) in (15) and (18) then yields d° pt (si;£i) = I{Hji)i > 0) and d° 2 pt {s 2l an 6) = J(W|> 2 > 0). 

We have presented (23) and (24) in the conventional WLS form, with leading term in the summand 
d/d£k Qk(Ski, Aki] ^fe)E^ 1 (5/ C i, Aki); taking to be a constant yields OLS. At the Kth decision, with 
responses Y{, standard theory implies that this is the optimal leading term when var(y|Sx = sk,Ak = 
ax) = ^k{sk > o>k)i yielding the efficient (asymptotically) estimator for For k < K, with "responses" 
V(k+i)ii this theory may no longer apply; however, deriving the optimal leading term involves considerable 
complication. Accordingly, it is standard to fit the posited models Qk{sk, a^; via OLS or WLS; some 
authors define Q-learning as using OLS (Chakraborty, Murphy, and Strccher, 2010). The choice may be 
dictated by apparent relevance of the homoscedasticity assumption on the V(k+i)ii k = K, K — 1, . . . ,1, 
and whether or not linear models are sufficient to approximate the relationships may also be evaluated, 
but see Section 4.3. 

4.2 A-Learning 

Advantage learning (^4-learning, Murphy, 2003) is an alternative to Q-learning that involves making fewer 
assumptions on the form of the Q-functions. For simplicity, we consider the case of two feasible treatment 
options coded as and 1 at each decision; i.e., vE , /c(s/ c , Sfe-i) = Af- = {0,1}, k = 1,...,K, though the 
methodology can be extended to an arbitrary number of treatments at each stage at the expense of 
complicating the formulation and notation. 

To fix ideas, consider (26). Note that d° pt (si; £1) implied by (26) depends only on H^i^i = Qi(si, 1; £1) — 
Qi(si,0;£i); likewise, d2 Pt (s 2 , af, £2) depends on H21P2 = <?2(s 2 , «i, 1; £2) - Q2($2, a\, 0; This is a 
special case of the general result that, for purposes of deducing the optimal regime, for each k = 1, . . . , K , 
it suffices to know the contrast function Ck{sk, <ik-i) = Qk(sk, Sfe-i, 1) — Qk(sk, Sfc-i, 0). This can be 
appreciated by noting that any arbitrary Q k (s k , «fe) may be written as hk(sk, Sfe_i)+afcCfc(sfc, Ofc_i), where 
hk(sk,a k ~i) = Qfe(sfe,afc_i,0), so that Q k (s k ,a k -i,a k ) is maximized by taking a k = I{Ck(s k ,ak~i) > 0}; 
and the maximum itself is the expression hk{sk,a-k-i) + C k (s k ,ak-i)I{C'k{sk,a,k-i) > 0}. 
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The premise of ^-learning is thus to model the contrast functions rather than the full Q-functions as in 
(^-learning. For k = K — 1, . . . ,1 the latter involve possibly complex relationships, raising concern over the 
consequences of model misspecification for estimation of the optimal regime. As identifying the optimal 
regime depends only on correct specification of the contrast functions, ^4-learning may be less sensitive to 
mismodeling. 

We now describe the ^-learning procedure. Assume posited models Ck(sk,ak-i',ipk), k = 1,...,K, 
say, for the contrast functions, each depending on a parameter ipi~. Consider the Kth decision. Given 
CK{s K ,a,K-i',^K), letting n K (s K ,a K _i) = pr(A K = l\S K = s K ,A K _i = Ojf-i) be the propensity of 
receiving treatment 1 in the observed data as a function of past history and writing Vrjc+i)i — Robins 
(2004) showed that all consistent and asymptotically normal estimators for ipK are solutions to estimating 
equations of the form 

n 

Xk(Ski, A( K _i)i){AKi - ttk(Ski, A(K-i)i)} 

x {V(K+i)i - A Ki C K {SKf,A^K-i)i,^K) - 0k{SkuA.{k-\)%)} = (27) 

for arbitrary functions Xk (sk , a-K-i) of the same dimension as ipK an d 6k {sk , o>K—i)- Assuming the 
model Ck{ski o>k—i\ "*Pk) is correct, if vax(Y\§K = Sjt, Art-i = Ofc-i) is constant, the optimal choices of 
these functions are Xr(Sk, B,k-i; V>k) = d/dipKC K (sK,a K -i;ipK) and 9 K (s Ki , a^ K -i)i) = h K (s K ,a K -i); 
otherwise, the optimal Xk is complex (Robins, 2004). 

To implement estimation of ipK via (27), one may adopt parametric models for these functions. Al- 
though the appeal of ^4-learning is that it obviates the need to specify fully the Q-functions, one may posit 
a model for the optimal 9k, hx{sKj o-K—ii Pk), say. Moreover, unless the data are from a SMART study, in 
which case the propensities ttk(sk, Q>k-i) would be known, these may also be modeled as ttk{sk, a>K-i] 4>k) 
(e.g., by a logistic regression). These models are only adjuncts to estimating the parameter of interest, 
ipK> interestingly, as long as at least one of these models is correctly specified, (27) will yield a consistent 
estimator for ipx, the so-called double robustness property. Substituting these models in (27), one solves 
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(27) jointly in «, P K A K Y with 

^ K \n K 1, ^ K ^ > {V(K+l)i - AKiCK{§Kh A.(K-\)i\ll>K) - hx{SKi, A(K-l)i\ /3k)} = 

and the usual binary regression likelihood score equations in <px- We then have {skjCLk-i'^k) — 
I[ Ck{sk, Q>k-i \ iPk} > 0]; as in Q-learning, substituting tpx yields an estimator for the optimal treatment 
choice at decision K for a patient with past history Sk = 8k,Ak—x = clk-i- 

With ipK m hand, as with Q-learning, the A-learning algorithm proceeds in a backward iterative fashion 
to yield ipk, k — K — 1, . . . , 1. At the fcth decision, given models hk(sk, Sfe-i; Pk) and itk(sk, Sfc-i; 4>k), one 
solves jointly in {4> k , /3 k , (fu) a system of estimating equations analogous to those above. As in Q-learning, 
the fcth set of equations is based on "responses" V(k+x)i, where, for each i, V k i estimates Vk(Ski, ^-(k-i),i)- It 
may be shown (see Section A. 3 of the Appendix) that E (v k+ i(S k+ i, A k ) + C k (S k , A k ^i)[I{C k (S k , A k -i) > 0} — A k ] 
Vk(Sk, Ak-i). The expression C k (S k , A k _i){I{C k (S k , Ak-i) > 0} — Ak ] is referred to as the advantage 
or regret function (Murphy, 2003), as it represents the "advantage" in response incurred if the optimal 
treatment at the fcth decision were given relative to that actually received (or, equivalently, the "re- 
gret" incurred by not using the optimal treatment). Accordingly, define recursively V k i = V(k+i)i + 
C k (S ki , A (k -i)i\ $k)[I{Ck{Ski, A(k-\)i, ipk) > 0} - A ki ], k = K,K- 1, V( K+1)i = Y l . The equations 

at the fcth decision are then 

n 

^ A fe (5' fcl , A( k -i)i;^k){Aki - Kk(Ski, A(k-i)i', <Pk)} 

i=l 

x{V (fe+ i)i - A ki Ck(Ski, A(k-i)i;ipk) - hk(§ki, A(k-i)i; Pkj} = 0, (28) 

— h ^ K 'c IR K ~ ^ {V(k+i)i — A ki C k (S kil A(k-i)i, i>k) — hk(§ki, A(k-i)i\ Pk)} = 0, 
»=i ° Pk 

for a given specification \ k (s k ,a k -i;ip k ), solved jointly with the maximum likelihood score equations for 
binary regression in <p k . It follows that d^ ?t (s k , a k -i] ip k ) = I[Ck{s k , Sfe-i; V'fc} > 0]. As above, the optimal 
\ k is complex Robins (2004); taking \ k {s kl a k _i]ip k ) = d/dip k C k (s k ,a k -i;ip k ) is reasonable for practical 
implementation. 
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Summarizing, the estimated optimal regime d°£ = (d°J' 1 , . . . , a£ k) 1S 

d A P i( s i) = d i Pt ( s i; ^l)) dj^(s k ,a k -i) = rffc Pt (sfe,a fe _i; fc = 2, ...,K, (29) 

As with Q-learning, how well aj^ estimates d opt depends on how close the Ck(sk, d>k-i'i "0fc) are to the true 
contrast functions. 

4.3 Comparison and Practical Considerations 

When K = 1, the Q-function is a model for E(F|S' 1 = Si,Ai = ai). If in Q-learning this model and 
the variance model £i in (23) are correctly specified, then, as noted above, the form of (23) is optimal 
for estimating £i. Accordingly, even if Ci(si;V>i) and /i 1 (s 1 ;/3 1 ) are correctly modeled, (28) with K = 1 
is generally not of this optimal form for any choice Ai(si;V ; i), and hence ^-learning will yield relatively 
inefficient inference on ipi and hence on the optimal regime. However, if in Q-learning the Q-function is 
mismodeled, but in j4-learning Ci(si;ipi) and Tti(s\\<j)i) are both correctly specified, then ^.-learning will 
still yield consistent inference on tpi and hence the optimal regime, whereas inference on £i and the optimal 
regime via Q-lcarning may be inconsistent. We assess the trade-off between consistency and efficiency in 
this case in Section 5. For K > 1, owing to the complications involved in specifying optimal estimating 
equations for Q- and yl-learning, the relative performance of the methods is not readily apparent; we 
investigate in Section 5. 

In certain special cases, Q- and ^4-learning lead to identical estimators for the Q-function (Chakraborty 
et al., 2010). For example, this holds if the propensities for treatment are constant, as would be the case 
under pure randomization at each decision point, and certain linear models are used for Ci(si]ipi) and 
hi(s±; f3i); see Section A. 4 of the Appendix for a demonstration when K = 1 and pr(Ai = 1\S\ = s{) does 
not depend on s\. 

As we have emphasized, for Q-learning, while modeling the Q-function at decision K is a standard 
regression problem with response Y, for decisions K — 1, . . . , 1, this involves modeling the estimated value 
function, which depends on the relationships for subsequent decisions. Ideally, the sequence of posited 
models Qk{sk, Ofc! should respect this constraint. However, this may be difficult to achieve with standard 



17 



regression models. To illustrate, consider the models in (26), and assume Si,S2 are scalar, where the 
conditional distribution of S2 given Si = Si,Ai = ai is Normal(/C^7, a 2 ), say, ICi — (1, Si,ai) T . Recall 
that V 2 (s2,ai;S, 2 ) — U2P2 + {W-2 ^2) I {W-2 4>2 > 0), where we can write H 2 P2 = ICj P21 + S2P22 and 
T~L 2 i } 2 — f^Jip2i + S2"022- Then, if the model Q2 in (26) were correct, from (17), ideally, Qi(si,ai) = 
E{V2(si, S 2 , ai ; ^2 ) | <Si = si, Ai — ai}. Letting <p(-) and <!>(•) be the standard normal density and cumulative 
distribution function, respectively, it may be shown (see Section A. 5 of the Appendix) that, under these 
conditions, 

Qi(si,ai) = E{V 2 (s 1 ,S2,a 1 ;£, 2 )\Si = s u Ai = aj = /Cf(/3 2 i + 7A22) 

V»2i){l - Hv)} + MMv) + (^7){1 - Hv)}}, (30) 

where 77 = — K.J (V , 2i/V ; 22 +7)/c, and we have taken -022 > 0. Contrast the implied true Qi(si,ai) in (30) 
to the posited linear model in (26); clearly, the true relationship is highly nonlinear in si,oi and is likely 
to be poorly approximated by Qi(si,<Zi;£i) in (26). Evidently, for larger K, this incompatibility between 
true and assumed models would propagate from K — 1, . . . , 1. Thus, while the use of linear models for the 
Q-functions is popular in practice, the potential for such mismodeling should be recognized. 

An alternative approach that may mitigate the risk of mismodeling is to employ flexible models for the 
Q-functions. Zhao, Kosorok, and Zeng (2009) use support vector regression models in place of the linear 
models described above. Indeed, recent developments in statistical learning suggest a large collection of 
powerful regression methods that might be used. Many of these methods must be tuned in order to balance 
bias and variance, a natural approach to which is to minimize the cross-validated mean squared error of 
the Q-functions at each decision point. An obvious downside is that the final model may be difficult to 
interpret, and clinicians may be unwilling to implement "black box" rules. One compromise is to fit a 
simple, interpretable model, such as a decision tree, to the fitted values of the complex model in order to 
get a feel for what factors are driving the recommended treatment decisions. One can then check the simple 
model against scientific theory. If the simple, approximate model appears sensible, then clinicians may be 
willing to use predictions from the more complex and less interpretable model. For further discussion and 
references, see Craven and Shavlik (1996). 



18 



^4-learning represents a middle ground between Q-learning and these approaches in that it allows for 
flexible modeling of the functions hk(sk, flfe-i) while maintaining simple parametric models for the contrast 
functions Ck{sk,a,k-i)- Thus, the resulting decision rule, which depends only on the contrast function, 
remains interpretable, while the model for the response is allowed to be nonlinear. This is also appealing 
in that it may be reasonable to expect, based on the underlying science, that the relationship between 
patient history and outcome is complex while the optimal rule for treatment assignment is dependent, in 
a simple fashion, on a small number of variables. The flexibility allowed by a semi-parametric model also 
has its drawbacks. Techniques for formal model building, critique, and diagnosis are well understood for 
linear models but much less so for semi-parametric models. Consequently, Q-leaxning based on building a 
series of linear models may be more appealing to an analyst interested in formal diagnostics. 

5 Simulation Studies 

We examine the finite sample performance of Q- and ^-learning on a suite of test examples via Monte 
Carlo simulation. To illustrate the trade-offs between the methods discussed in the preceding sections, we 
begin with correctly specified models and then systematically introduce misspecification of the Q-function, 
the propensity model, and both the Q-function and propensity model. In all cases, the contrast function 
is correctly specified, as, when this is not the case, the form of the optimal regime induced by an incorrect 
contrast function may not include d opt , making interpretation difficult. In all scenarios, 10,000 Monte Carlo 
replications were used, and, for each generated data set, the estimated optimal regimes dg pt and d^ pt in 
(25) and (29) were obtained using the Q- and ^-learning procedures described in Sections 4.1 and 4.2. 

For simplicity, we consider one (K = 1) and two stage (K = 2) decision problems, where, at each 
decision point, there are two feasible treatment options coded as and 1. In all cases, we used Q-functions 
of the form Qi(si, ffli; £i) = fti(si;/3i) + aiCi(si; i/'i) and Q 2 (s2, ^2; 6) = h 2 (s 2 ; /3 2 ) + a 2 C 2 (s2, ai; ^2) 
to represent both true and assumed working models. With the contrast functions correctly specified, the 
parameters ipk, k = 1,2, dictate the optimal regime. Thus, as one measure of performance, we focus 
on relative efficiency of the estimators of components of ipk obtained by Q-leaxning to those obtained by 
^4-learning, as reflected by the ratio of their Monte Carlo mean squared errors (MSEs) (so by MSE of A- 
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learning/MSE of Q-learning) , so that values greater than 1 favor Q-learning. Recognizing that E{F*(<i opt )} 
is the benchmark achievable outcome on average, as a second measure, we consider the extent to which the 
estimated regimes d < Q t and d^ pt achieve E{F*(eP pt )} if followed by the population. Specifically, for regime 
d indexed by ipi (K — 1) or tpx and ip 2 (K = 2), let H(d) = E{Y*(d)}, a function of these parameters. 
Then H(d opt ) = E{F* (d opt )} is this function evaluated at the true parameter values, and H(d opt ) is this 
function evaluated the estimated parameter values for a given data set, where d apt represents dg pt or d°^ pt . 
Define R(d opt ) — F>{H(d opt )}/ H(d opt ), where the expectation in the numerator is with respect to the 
distribution of the estimated parameters in d opt , which may be interpreted as reflecting the efficiency with 
which d opt achieves the performance of the true optimal regime. In Section A. 6 of the Appendix, we discuss 
calculation of R(d opt ). 

5.1 One Decision Point 

In this and the next section, n — 200. Here, the observed data are (Sij, An, Yi), i — 1, ...,n. With 
cxpit(a;) = e x / (1 + e x ), to generate the data, we used 

Si ~ Normal(0, 1), Ai|Si = Si ~ Bernoulli{expit(0? o + 0? x si + 0? 2 s?)}, 
F|Si = si, Ai = oi ~ Normal^ + p xl s x + p Q 12 s\ + oi(V>? + 3}, 

so that the class of generative models is indexed by 9° — {4>%i4> < ii,4'i-2i flio> Pin P121 ^i0' V , ii) 2 \ and d° pt = 
d° pt , d° pt (si) = /(V'io + ^?i s i > 0)- For ^-learning, we assumed working models h\{s\\Pi) — /3 10 + Pu s ii 
Ci(sx;ipx) — V'lO + ipn&i, and n 1 (si;4> 1 ) — expit(0i O + 0nSi), and for Q-learning used Qi(si, ai; £i) = 
hi{si', ft) + a iCi( s ij "01 )• Note that these working models involve correctly specified contrast functions and 
are nested within the true generative models, with /ii(si; /?i), and hence the Q-function, correctly specified 
when f3® 2 = 0. Similarly, the propensity model 7r 1 (s 1 ; (f>i) is correctly specified when <fi1 2 = 0. To study the 
effects of misspecification, we systematically varied these two parameters while keeping the others fixed, 
considering parameter settings of the form 6° = (0, —2, 4>\ 2 , 1, 1, Pi 2 , 1, 0.5) T . 

Correctly specified models. As noted in Section 4.3, when all working models are correctly specified, 
Q-learning is more efficient than ^4-learning. Under our class of generative models, this occurs when 



20 



P12 = ^12 = 0- I n this scenario, the relative efficiency of Q-learning with respect to Q-learning is 1.06 for 
estimating ip® and 2.74 for estimating ipii- Thus, Q-learning is a modest 6% more efficient in estimating 
ipio but a dramatic 174% more efficient in estimating i/j^. Interestingly, the efficiency of the decision rules 
produced by Q- and ^4-learning is similar, with R(dQ lt ) = 0.97 and R(d^l ) = 0.95, so that the relative 
inefficiency in estimation of ipi suffered by ^4-learning does not translate in to a regime of poorer quality 
than that found via Q-learning. 

Misspecified propensity model. An appeal of ^-learning is the double robustness property noted in Sec- 
tion 4.2, which implies that ip\ should be estimated consistently when the propensity model is misspecified 
provided that the Q-fimction is correct. Under our class of generative models, this corresponds to /3j 2 = 
and nonzero <p\ 2 . In contrast, Q-learning does not depend on the propensity model, so its performance is 
unaffected by this misspecification. Figure 1 shows the relative efficiency in estimating ip® and ip^ and the 
efficiency of dg pt and d°^ as 4>\ 2 varies from — 1 to 1 . The leftmost panel shows that there is minimal gain in 
efficiency by using Q-learning instead of ^4-learning in estimation of t/>J . From the center panel, Q-learning 
yields substantial gains over A-learning for estimating ipii- Interestingly, the gain in efficiency of Q- over 
j4-learning is largest when <p\ 2 = 0, which corresponds to the propensity model being correctly specified. 
Letting tt°(si; cf>i) be the true propensity, 4>1 — (<Piq, 4>i2) T , a possible explanation for this seemingly 
contradictory result is that, as |0? 2 | gets larger, logit{7r°(Si; </>?)} = + 0ii s i + ^i2 s i becomes more pro- 
foundly quadratic. Consequently, the estimator for <pn in the posited model 7Ti(si; (f>i) ~ expit(0io + 0nSi) 
approaches zero, so that the posited propensity approaches a constant. Because Q- and ^-learning are 
equivalent under constant propensity, the efficiency gains decrease as \4>i 2 \ — > oo. The right panel of 
Figure 1 shows a small gain in efficiency of <ig pt over d°4 Pt , with both achieving good performance. 

Misspecified Q-function. This scenario examines the second aspect of ^-learning's double-robustness 
and is characterized in our class of true generative models by <p\ 2 = an d nonzero /8j 2 . Here, ^4-learning 
leads to consistent estimation while Q-learning need not. The left panel of Figure 2 shows that the gain 
in efficiency using A- learning is minimal in estimating 4>iq- The center panel illustrates the bias- variance 
trade-off associated with choice between Q- and ^4-learning. For values of (3® 2 that are far from zero, the 
bias in the misspecified Q-function dominates the variance, and j4-learning enjoys smaller MSE while, for 
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Figure 1: Monte Carlo MSE ratios for estimators of components of ipi (left and center panels) and efficien- 
cies R(d < Q t ) and i?(d^ pt ) for estimating the true d opt (right panel) under misspecification of the propensity 
model. MSE ratios > 1 favor Q-learning 

small values of /3® 2 , variance dominates bias, and Q-learning is more efficient. The right panel shows that 
large bias in the Q-function can lead to meaningful loss (around 10%) in efficiency of dg pt relative to d^ pt . 

Both propensity model and Q -function misspecified. In our class of generative models, this corresponds 
to nonzero values of both /3j 2 arL d <fii 2 - Rather than vary both values, (e.g., over a grid), we varied one and 
chose the other so that it is "equivalently misspecified." In particular, for a given value of 4>\ 2 i we selected 
P12 = /^12 (^12) so that the i-statistic associated with testing (f>i 2 = in the logistic propensity model and 
the ^-statistic associated with testing = in the linear Q-function would be approximately equal in 
distribution. Consequently, across data sets, an analyst would be equally likely to detect either form of 
misspecification. Details of this construction are given in Section A. 7 of the Appendix. 

As in the preceding scenario, Figure 3 illustrates the bias-variance trade-off associated with Q- and 
^4-learning. For large misspecification, ^4-learning provides a large enough reduction in bias to yield lower 
MSE; for small misspecification, Q-learning incurs some bias but reduces the variance enough to yield 
lower MSE. From the right panel of the figure, bias seems to translate into a larger loss in quality of the 
estimators of d opt than variance. 
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Figure 2: Monte Carlo MSE ratios for estimators of components of ipi (left and center panels) and efficien- 
cies R(d ( Q t ) and i?(d^ pt ) for estimating the true d opt (right panel) under misspecification of the Q-function. 
MSE ratios > 1 favor Q-learning 

5.2 Two Decision Points 

For K = 2, the observed data available to estimate d opt = (di Pt ,d 2 pt ) are (Su, An, S 2 i, A 2 i, Yi), i = 1, . . . , n. 
For these scenarios, we used a class of true generative data models that differs from those of Chakraborty 
et al. (2010), Song et al. (2010), and Laber et al. (2010) only in that S 2 is continuous instead of binary. 
The generative model is 

Si ~ Bernoulli(0.5), A^Si = Si ~ Bernoulli{expit(^ + 0nSi)}, 

5 2 |5i =st,Ax =a t ~ Normal(5? +5?!Si +J?2 a i + ^i3 s i £l i) 2 ) J 
A 2 \Si = si,S 2 = s%,Ai — a\ ~ Bernoulli{expit(</>20 + 0°i s i + 4> 22 o-i + 023 s 2 + 024 a i s 2 + 025 s 2)}j 
Y\S\ — si, S 2 = s 2 ,A 1 = ai, A 2 = a 2 ~ Normal{m(si, 82,01,03), 10}, 

m(si,S 2 ,ai,a 2 ) = /3°0 +/ 3 21 s l + ^22 a l + /?23 s l a l + /?24 s 2 + /^S^ + "2(^20 + V>21 a l + V^^)- 

The model is indexed by 0° = (cj> w ,<p n ) T , 5° = (6%, 6° n , 5% o» 3 ) T , 0° = (0° O) $ 2 , ^ 3 ,^ 4 ,^ 5 ) T , 
/3 2 ° = GC$i,&A$4>$ii) r . an d ^ = $1, V&) T , with true ^1,82,01) = I3° 20 + f3° 2lSl + 
/3 22 ai + /S^SiOi + P 2i s 2 + (3 25 s 2 and contrast function C 2 {s\, s 2 , a x ) = t/^o + "02i a i + '022 s 2i sa Y- Because 
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Figure 3: Monte Carlo MSE ratios for estimators of components of tpi (left and center panels) and effi- 
ciencies R(d < Q t ) and R(d opt ) for estimating the true d opt (right panel) under misspecification of both the 
propensity model and the Q-function. MSE ratios > 1 favor Q -learning 

A\ and S\ are binary, the true functions \v{{s\) — /3i + Ptt s i an d Ci( s i) — V'io + i/ ; ii s i ; are linear in 
si; @iq, y^iij^ioj and V'11 are derived in terms of parameters indexing the generative model in Section A. 8 
of the Appendix. Thus, the true optimal regime has d° pt (si) = /("0io + V'li'Si > 0) an d ^2 Pt (si, s 2 , at) = 

1(^0 + Wl + ^22*2 >0). 

We assumed working models for ^-learning of the form hi(si; j3\) = f3\o + /?nsi, Ci(si;^i) = ipto + 
iplisi, 7ri(si;0i) = expit(0i o +^iisi), h 2 (st, s 2 , ai; /3 2 ) = fto+AnSi+feai+AisSiai+AM^, C 2 (si, s 2 , ai; ^2) = 
V>20 + "02101 + fa 2 s 2 , and 7r 2 (si, s 2 , 01; fa) = expit(0 2O + fatst + fa 2 a\ + fa^s 2 + faia\s 2 ); and, similarly, 
assumed (J-functions of the form Qi(si,ai;£i) = h\(s\\fi\) + aiCi(s\;ipx) and Q 2 (st, s 2 , at, a 2 ; £2) = 
h 2 (si, s 2 , 01; /3 2 ) + a 2 C 2 (si, s 2 , ai; ^2) for Q-learning, so that the contrast functions are correctly specified 
in each case. Comparison of the working and generative models shows that the former are correctly speci- 
fied when and /3 25 are both zero and are misspecified otherwise. Thus, we systematically varied these 
parameters to study the effects of misspecification, leaving all other parameter values fixed, taking 0° = 
(0.3, -0.5) T , 5° = (0, 0.5, -0.75, 0.25) T , 0° = (0, 0.5, 0.1, -1, -0.1, 0° 5 ) T , /3 2 ° = (3, 0, 0.1, -0.5, -0.5, /3 2 ° 5 ) T , 
and ip% = (1,0.25,0.5) T . 

Correctly specified models. Given our working models, this occurs when c/> 25 = P25 = in the generative 
models. As discussed previously, Q-learning is efficient when the models are correctly specified. Relative 
efficiencies of Q- learning with respect to ^-learning for estimating ip® , ip^, f/^g, f/^, and are 1.07, 1.03, 
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Figure 4: Monte Carlo MSE ratios for estimators of components of ip2 and tpi (upper row and lower row 
left and center panels) and efficiencies R(d < Q t ) and i?(d^ pt ) for estimating the true d opt (lower right panel) 
under misspecification of the propensity model. MSE ratios > 1 favor Q-learning 

1.19, 1.44, and 1.98, respectively. Hence, Q-learning is markedly more efficient in estimating the second 
stage parameters but only modestly so in estimating first stage parameters. More efficient estimators of the 
underlying parameters do not translate into significantly more efficient estimated regimes, as R(d < Q t ) ~ 0.96 
and i?(d° pt ) = 0.96. 

Misspecified propensity model. The propensity model at the second stage is misspecified when 025 is 
nonzero. To isolate the effects of such misspecification, we set /3!>5 = and varied (/>2 5 between —1 and 
1. From Figure 4, Q-learning is more efficient than yl-learning for estimation of all parameters in ipi and 
ip2 , and, as in the one decision case, the efficiency gain is largest when the </>2 5 = 0, corresponding to a 
correctly specified propensity model. From the lower right panel, there appears to be little difference in 
efficiency of cZg pt and d°4 Pt . 
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Figure 5: Monte Carlo MSE ratios for estimators of components of 4*2 and tpt (upper row and lower row 
left and center panels) and efficiencies R(d < Q t ) and i?(d^ pt ) for estimating the true d opt (lower right panel) 
under misspecification of the Q-functions. MSE ratios > 1 favor Q-learning 

Misspecified Q-function. Under our class of generative models, the Q-function is misspecified when /3f;5 
is nonzero. We set </>25 = to focus on the effects of such misspecification. Figure 5 shows that, for the 
first stage parameters ipio an( i ^11 1 there is little difference in efficiency between Q- and 4-learning. The 
upper panels illustrate varying degrees of the bias- variance trade-off between the methods. In particular, in 
estimating -022; a small amount of misspecification leads to significant bias, and hence ^4-learning produces 
a much more accurate estimator, while, for -02o> the bias- variance trade-off is present but attenuated, and 
there is little difference between Q- and yl-learning. In estimation of "021 > variance appears to dominate 
bias, and Q-learning is preferred for the chosen range of f3®5 values. From the lower right panel, relative 
efficiency for estimating ^22 weakly tracks the relative efficiencies of the estimated regimes d ^*" and , 
suggesting that the efficiency gain for A- learning in estimating ^22 leads to improved estimation of d opt . 
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Figure 6: Monte Carlo MSE ratios for estimators of components of 4*2 and tpt (upper row and lower row 
left and center panels) and efficiencies R(d < Q t ) and i?(d^ pt ) for estimating the true d opt (lower right panel) 
under misspecification of both the propensity models and Q-functions. MSE ratios > 1 favor Q-learning 

Both the propensity model and Q-function misspecified. Under our generative model, this scenario 
corresponds to nonzero values of /3°5 and 0E> 5 . Analogous to the one decision case, we chose pairs (025, 025 ) 
that are "equivalently misspecified;" see Section A. 7 of the Appendix. Figure 6 shows the relative efficiency 
of the two methods. There is no general trend in efficiency of estimation across parameters that might 
recommend one method over the other. Furthermore, from the lower right panel, there is little difference in 
efficiency of the estimated regimes. This is as expected, as one should not expect to draw broad conclusions, 
as neither Q- nor ^4-learning need be consistent here. Interestingly, despite misspecification of both models, 
d < Q t and d^ pt still enjoy high efficiency. 
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5.3 Moodie and Richardson Scenario 

The foregoing simulation scenarios deliberately involve simple models for the Q-functions in order to allow 
straightforward interpretation. To investigate the relative performance of the methods in a more challenging 
setting, we generated data from a scenario similar to that in Moodie et al. (2007) in which the true contrast 
functions are simple yet the Q-functions are complex. 

The data generating process used mimics a study in which HIV-infected patients are randomized to 
receive antiretroviral therapy (coded as 1) or not (coded as 0) at baseline and again at six months, where 
the randomization probabilities depend on baseline and six month CD4 counts. Specifically, we generated 
baseline CD4 count Si ~ Normal(450, 150 2 ), and baseline treatment A\ was then assigned according 
to .Ail Si = s\ ~ Bernoulli{expit(</)] ) o + <\>\yS\j\. We generated six month CD4 count S2, distributed 
conditional on Si = Si,Ai = a\ as Normal(1.25si, 60 2 ). Treatment A 2 was then generated according to 
A2IS1 = si, Ai = ai, S2 = s 2 ~ Bernoulli{cxpit(02o + ( / ) 2i s 2)}- In contrast to the scenario in Moodie et al. 
(2007), this allows all possible treatment combinations. The outcome Y is CD4 count at one year; following 
Moodie et al. (2007), Y was generated as Y = F opt - ^5 (Si, A x ) - f4(S u S 2 , A u A 2 ), where y opt |Si = 
81, Ai = ai,S 2 = s 2 ,A 2 = a 2 ~ Normal(400 + 1.6si,60 2 ). Here, //?(Si,A. x ) and (jt%(Si, S 2 , A x , A 2 ) are the 
true advantage (regret) functions; we took Cf(si) = V'io + ^11 s i an d C 2 {8i, s 2 , a\) — -020 + ^21 s 2 to be 
the true contrast functions, so that, from Section 4.2, 

H°i(Si,Ai) = (i>° w + i; iiSi)m O i O + iP O iiSi>0)-Ai}, (31) 
f i° 2 (S 1 ,S 2l A 1 ,A 2 ) = (yj a 20 + yj° 21 S 2 ){I(^ 20 +^ 21 S 2 >0)-A 2 }. (32) 

It follows that the optimal treatment regime d opt = (d° pt ,d 2 pt ) has d° pt (si) = I(if>% + ^liSi > 0) an d 
d2 Pt (si, s 2 , ai) — I{ip 20 + ^ 2 i s i > 0). While the true contrast functions are linear in ijQ, k = 1, 2, the true 
implied hl(si) and h 2 (si, a\, s 2 ) are nonsmooth and possibly complex. 

Following Moodie et al. (2007), for A-learning, we assumed working models h\(s\\0{) = /3io + finS\, 
Ci{si;ipi) = ipw+tpnsi, h 2 {s\, s 2 , ax; (3 2 ) = Am + A21S1 +/3 22 ai + /3 23 sxai +/3 24 s 2 , and C 2 (si,s 2 ,ai;ip 2 ) = 
4>20 + tyixSi , and assumed propensity models of the form 7i"i(si; 4>\) = 0io + 4>iiS\ and ir 2 (si, s 2 , «i; 4>i) = 
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02o + < / ) 2iS2- For Q-learning, we analogously assumed Q-functions Qi(si, &i; £1) = ^i(*i> Pi) 
and Q2(si, S2, ai, 02! £2) = ^(si, si, Oi; (82) + 02^(51, S2, ^i; V^)- Note that the contrast functions in each 
case are correctly specified, as are the propensity models; however, the Q-functions are misspecified, as the 
linear models hi(s\\f3\) and /i2(si, Si, ai; P2) are poor approximations to the complex forms of the true 
/i°(si) and h%(si,S2,ai). 

We report results for n = 1000 with $ = (0%, (jP^f = (2.0, -0.006) T , (jP 2 = O§ , <jP 21 ) T = (0.8, -0.004) T , 
^0 = (^0 Q! ^o^t = ( 250j -1.0) T , and ipl = (ip% , i}^) T = (720, -2.0) T in Table 1. Because the Q-functions 
are misspecified, not unexpectedly, the Q-learning estimators for ip® and are biased, while those ob- 
tained via ^-learning are consistent owing to the double robustness property. This leads to the dramatic 
relative inefficiency of Q-learning reflected by the MSE ratios. Under the assumed models, the estimated 
optimal regime for Q-learning dictates that, at baseline, antiretroviral therapy be given to patients with 
baseline CD4 count less than 199.7, while that estimated using ^-learning gives treatment to those with 
baseline CD4 count less than 249.1, almost perfectly achieving the true optimal CD4 threshold of 250. 
Under the data generative process, using the baseline decision rule estimated via Q-learning may result in 
as many as 4.4% of patients who would receive therapy at baseline under the true optimal regime being 
assigned no treatment. Similarly, at the second decision, the estimated optimal regimes obtained by Q- 
and ^4-learning dictate that therapy be given to patients with six month CD4 count less than 320.2 and 
360.1, respectively. Again, ^-learning yields an estimated threshold almost identical to the optimal value 
of 360. Although that obtained via Q-learning is lower, 4.3% of patients who should receive therapy at six 
months would not if the estimated six month rule from Q-learning were followed by the population. 

Using the approach outlined in Section A. 6 of the Appendix, we have H(d opt ) — 1120, whereas 
E{£r(dg pt )} fa 1117.1 (estimated standard error 1.3) and E{£T(d° pt )} rj 1119.9 (0.3), so that i?(dg pt ) 
and i?(d° i pt ) are virtually equal to one. Thus, although Q-learning results in poor estimation of param- 
eters in the contrast functions, efficiency loss for estimating the optimal regime is negligible. A possible 
explanation is that for the advantage (regret) functions in (31) and (32), patients near the true treatment 
decision boundary would have C®(Sk, Ak-i), k = 1,2, close to zero. Thus, even if a regime improperly 
assigns treatment, patients near this boundary have only a small loss in expected outcome. This, and 
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Table 1: Monte Carlo average (standard deviation) of estimates obtained via Q- and A-learning and ratio 
of Monte Carlo MSE for the Moodie and Richardson scenario; MSE ratios > 1 favor Q-learning 



Parameter (true value) 


Q-learning 


j4-learning 


MSE ratio 


r w = 250 
< - -1.0 
^ 2 °o = 720 
V'21 = -2-0 


154.8 (23.2) 
-0.775 (0.052) 

507.3 (49.2) 
-1.584 (0.092) 


249.1 (18.7) 
-0.998 (0.041) 

720.3 (48.4) 
-2.001 (0.085) 


0.036 
0.032 
0.050 
0.040 



the aforementioned fact that only a small subset of the population is affected by poor treatment decisions 
under Q-learning, results in the relatively good expected outcome under the estimated Q-learning regime. 

6 Application to STAR*D 

Sequenced Treatment Alternatives to Relieve Depression (STAR*D) was a prospective multisite, random- 
ized clinical trial enrolling 4041 patients designed to compare various treatment options for patients with 
major depressive disorder. The trial involved four levels, where each level consisted of a 12 week period of 
treatment, with scheduled clinic visits at approximate two week intervals (weeks 0, 2, 4, 6, 9, 12). Severity 
of depression at any visit was assessed using clinician-rated and self-reported versions of the Quick Inven- 
tory of Depressive Symptomatology (QIDS) score (Rush et al., 2003), for which higher values correspond 
to higher severity. At the end of each level, patients deemed to have an adequate clinical response to that 
level's treatment did not move on to future levels, where an adequate response was defined by 12-week 
clinician-rated QIDS score < 5 (remission) or showing a 50% or greater decrease from the baseline score at 
the beginning of level 1 (successful reduction). During level 1, all patients were treated with citalopram. 
Patients continuing to level 2 due to inadequate response were eligible to receive one of up to seven treat- 
ment options. We classify these options as either (i) switch: sertraline, bupropion, venlafaxine, or cognitive 
therapy, or (ii) augment: citalopram plus one of either bupropion, buspirone, or cognitive therapy. Patients 
assigned to cognitive therapy (alone or augmented with citalopram) were eligible, in the case of inadequate 
response, to move to a supplementary level 2A and switch to either bupropion or venlafaxine. All patients 
without adequate response at level 2 (or 2A, if applicable) continued to level 3. Level 3 treatments can 
again be classified as either (i) switch: mirtazepine or nortriptyline or (ii) augment with either: lithium 
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or triiodothyronine. Patients without adequate clinical response continued to level 4, requiring a switch 
to either tranylcypromine or mirtazepine combined with venlafaxine. For a complete description see Rush 
ct al. (2004). 

To demonstrate formulation of this problem within the framework of Sections 2 and 3, we take level 
2A to be part of level 2 and consider only levels 2 and 3 of the study, calling them stages (decision points) 

1 and 2, respectively (K = 2). Hence, we include in the analysis only the 1260 patients who entered level 
2; 330 of these subsequently continued to level 3. Let Ak, k — 1,2, be the treatment assigned at stage 
k (beginning of level k + 1), taking values (augment) or 1 (switch); both options are feasible for all 
eligible subjects. Let Sio denote baseline QIDS score and Sn denote the most recent QIDS score at level 
1/beginning of level 2, respectively, so that Si = (Sio, Sn) T is information available immediately prior to 
the first decision. Similarly, let S 2 be the information available immediately prior to decision 2; here, S 2 
is the most recent QIDS score at the end of level 2/beginning of level 3. Finally, let T be QIDS score at 
the end of level 3. Because some patients exhibited adequate response at the end of level 2 and did not 
progress to level 3, we define the outcome of interest to be —S 2 (negative QIDS score at the end of level 2) 
for patients not moving to level 3 and — (S 2 + T)/2 (average of negative QIDS scores at the end of levels 

2 and 3) otherwise. Thus, writing L = max(5, 6*10/2), Y = -S 2 I{S 2 < L ) - (S 2 + T)I(S 2 > L )/2, the 
cumulative average negative QIDS score. Thus, this demonstrates the case where outcome is a function of 
accrued information over the sequence of decisions. 

It is straightforward to deduce from (14) that Q 2 (s 2 ,a 2 ) — E(Y\S 2 — s 2 ,A 2 — a 2 ) = —s 2 {I(s 2 < 
lo) + I(s 2 > l )/2} + E(-T\S 2 = s 2l A 2 = a 2l S 2 > l )I(s 2 > l )/2, so that V 2 (s 2 , ai ) = -s 2 I(s 2 < 
l Q ) + {-s 2 + U 2 (s 2 ,a 1 )}I(s 2 > l )/2, where U 2 {s 2 ,ai) = max a2 E(-T\S 2 = s 2 ,A\ = a\, A 2 = a 2 ,S 2 > lo)- 
Thus, from (17), 

Qi(s u ai) - E[-S 2 I(S 2 < l ) + {-S 2 + U 2 (s 2 , ai )}I(S 2 > h)/2\S x = s a ,A 1 = a x ]. 

We describe implementation for (J-learning. At the second decision point, we must posit a model for 
Q 2 {s 2 ,a 2 ). From the form of Q 2 (s 2 , a 2 ), we need only specify a model for E{— T\S 2 = s 2 , A 2 = a 2 , S 2 > lo); 
given the form of the conditioning set, this may be carried out using only the data from patients moving to 
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level 3. Based on exploratory analysis, denning S22 to be the slope of QIDS score over level 2 based on su 
and S2, we took this model to be of the form /?2o + P21S2 + /?22S22 + V'20 a 2, so that the posited Q-function 
is 

02(*2, 03; 6) = S2{I{s 2 < lo) + I(s 2 > l )/2} + I{s 2 > / )(A>o + 021*2 + A22S22 + V>2oa 2 )/2, (33) 

£2 = (/320,/3 2 i,/322>2o) T Under (33), V 2 (s 2 ,ai;&) = -s 2 {/(s 2 < lo) + I{s 2 > l )/2} + I{s 2 > Z ){/3 2 o + 
P21S2 + hiS22 + ^20^(^20 > 0)}/2, and the "responses" V?.,i for use in (24) may then be formed by 
substituting the estimate for £2- Based on exploratory analysis, we took the posited Q-function at the first 
stage to be Qi(s\, a%; £1) = /?io + /?iisii + /?i2Si2 + ai^io + Tpns^), where s\2 is the slope of QIDS score 
over level 1 based on sio and su; and £1 = (/?io, Pu, "0io, ipn) T ■ F° r ^4-learning, we posited models for 
the functions hk(sk,a,k-i) and Ck(sk, afe_i), fc = 1,2, in the obvious way analogous to the models above, 
and we took the propensity models to be of the form 7T2(s2, oi! ^2) = expit(02o + 021*2 + 022*22 + 023ii) 
and m(si; 4>i) = expit(0i O + ^usii + 4>i2Su)- 

The results are presented in Table 2. At the first stage, Q-learning suggests a treatment switch for those 
with level 1 QIDS slope greater than -0.97 (obtained by solving 1.12 + 1. 155*12 = 0); ^4-learning assigns a 
treatment switch for those with QIDS slope during level 1 greater than -1.07. At the second stage (level 3), 
the results suggest that all patients should switch treatment and not augment their existing treatments. 

7 Discussion 

We have provided a self-contained account of Q- and yl-learning methods for estimating optimal dynamic 
treatment regimes, including a detailed discussion of the underlying statistical framework in which these 
methods may be formalized and of their relative merits. Our simulation studies confirm that, while A- 
learning may be inefficient relative to Q-learning in estimating parameters that define the optimal regime 
when the Q-functions required for the latter are correctly specified, ^4-learning may offer robustness to 
such misspecification. Nonetheless, Q-learning may have practical advantages in that it involves modeling 
tasks familiar to most data analysts, allowing the use of standard diagnostic tools. On the other hand, 
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Table 2: STAR*D data analysis results. Asterisks indicate evidence at level of significance 0.05 that the 
parameter is non-zero 





Q 


-learning 




A-learning 




Parameter 


Estimate 


95% CI 




Estimate 


95% CI 








Stage 


2 








/?20 


-1.46 


(-3.47 , 0.55) 




-1.47 


(-3.49 , 0.54) 




021 


-0.75 


(-0.88 , -0.61) 


* 


-0.75 


(-0.88 , -0.61) 


* 


/?22 


1.17 


(0.52 , 1.81) 




1.17 


(0.52 , 1.81) 


* 


V'20 


1.10 


(0.02 , 2.19) 


* 


1.12 


(0.03 , 2.22) 


* 






Stage 


1 










-1.12 


(-2.22 , -0.03) 




-0.90 


(-2.03 , 0.22) 




Pu 


-0.58 


(-0.65 , -0.51) 




-0.59 


(-0.66 , -0.52) 


* 


fa 


0.01 


(-0.42 , 0.45) 




0.11 


(-0.34 , 0.57) 






1.12 


(0.43 , 1.80) 


* 


0.90 


(0.17 , 1.64) 


* 


4>12 


1.15 


(0.20 , 2.10) 


* 


0.84 


(-0.24 , 1.92) 





^4-learning may be preferred in settings where it is expected that the form of the decision rules defining the 
optimal regime is not overly complex. However, ^4-learning increases in complexity with more than two 
treatment options at each stage, which may limit its appeal. Interestingly, our simulations demonstrate 
that inefficiency and bias in estimation of parameters defining the optimal regime docs not necessarily 
translate into degradation of performance of the estimated regime for either method. 

There remain many unresolved issues in estimation of optimal treatment regimes using these and other 
methods. Approaches to address the challenges of high-dimensional information and large numbers of 
decision points arc required. Existing methods for model selection focusing on minimization of prediction 
error may not be best for developing models optimal for decision-making. Formal inference procedures 
for evaluating the uncertainty associated with estimation of the optimal regime are challenging due to the 
nonsmooth nature of decision rules, which in turn leads to nonregularity of the parameter estimators; see 
Chakraborty et al. (2010), Laber et al. (2010), Song et al. (2010), and Laber and Murphy (2011). 

We have discussed sequential decision-making in the context of personalized medicine, but many other 
applications of these methods exist where, at one or more times in an evolving process, an action must be 
taken from among a set of plausible actions. Indeed, Q-learning was originally proposed in the computer 
science literature with these more general problems in mind; see Shortrccd et al. (2010) for discussion. 
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Appendix 

A.l Demonstration That (5)— (8) Define an Optimal Regime 

For k — 1, . . . , K and any d € T>, define the random variables afc{5^,((ifc_i)} such that 

a fe {^(4-i)}M = F fe (1) (s fc ,u fe _i) (A.l) 

for any w £ f2, where (sk,Uk-i) are defined by (3). We now argue that S lS>opt is an optimal regime; i.e., 
d (i)o P t gat i snes (4). We first show that, for any d £ V, 

E{y*(d)|5! = si, SZ(di), . . . , S* K {d K -x)} < E{r*(J K -i, 4 )0pt )l^i = *i, S 2 *(di), . . . , S* K (d K -i)} 

= a K {8 1 ,SS(d 1 ),...,Si c (d K -i)}. (A.2) 

This follows because, for the set in f2 where {S%(di) = S2, ■ ■ ■ , S' K (dfc-i) = sk}, the left- and right-hand 
sides of the first line of (A.2) are equal to 

E{Y*(d)\S* K (d K - 1 ) = s K } = E{Y*(u K -x,u k )\S* k (uk-i) = s k }, (A.3) 
E{Y*(J*_i,d£ )opt )|S*(J*_i) = s K } = E[Y*{u K . 1 ,d^ op \s K ,u K . 1 )}\S K (u K . 1 ) = s^A) 

respectively. By the definition of d^)° vt in (5), (A. 4) is greater than or equal to (A.3), and, by the 
definition of V*p in (6), (A. 4) equals V^\sk,Uk—i)- Because these results hold for sets {S%(dx) = 
s 2 , . . . , S* K {dK-i) — sk} for any (s 2 , . . . , Sk), and by the definition of in (A.l), (A.2) holds. Taking 
conditional expectations given Si = si yields 

E{T*(d)|Si = Sl } < E{y*(d^_i,^ )opt )|S 1 - si} 

= E[a K {s 1 ,SZ{d 1 ),...,S* K (d K - 1 )}\S 1 = s 1 }. (A.5) 
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The equality in (A. 5) holds for any dx-i — (di, . . . , dx-i), hence it must hold for (d±, . . . , dx-2, d^_° pt ). 
Thus, we also have that 

E{Y*(d K ^,d^\d { K ] ° Pt )\Si = *l} 

- E[a K {Si, S5(di), . . . , S^_i(djc_ 2 ), S^(djf_ 2 , d^f)}^ = si]. (A.6) 

Similarly, for any fe = K—l, . . . , 1, we can show that E[a/ c+ i{5'i, S% (di), . . . , S , £, 1 (dfe)}|iS'i = Si, 51 (di), . . . , S% (dfe 
E[a fc+ i{5i,52(di), . . .,Sl +1 (dk-i,%' opt )}\Si = s^S^di), . . .,S£(dk-i)] = afc{si,^(di), • • • ><Sfc(4_i)}, 
which implies for k = K — 1,...,1, 

E[ ttfc+ i{si,55(di), . . . , SJ +1 (<i fc )}|5i = si] < E[a fc+1 { Sl , S^d x ), ^+i(d fc _i, d^ ^)}]^ = si] 

= E[a Jt {ai,S5(di),...,S]E(J fc _i)}|S r i = «i] (A.7) 

Using (A. 5) and (A.7) with fc = K — 1, we thus have 

E{y*(d)|Si = si} < E{y*(d if _ 1 ,d« opt )|S' 1 = si} = E[a K { Sl , ^(dr), . . . , S^(dx-i)|5i = s x ] 

<E[ aA -{ Sl ,5 2 *(d 1 ),...,5^(d K -2,4-T)l^i (A.8) 
= E[aj<:_i{si,5|(di), . . . , S* K _ 1 (d K -2)\Si = si] 

Because of (A.6), the term in (A.8) is equal to E{F* (d K _ 2 , djj_ pt , d^ opt )|5i = si}. Hence, 

E{r*(d)|5i = si} < E{(y*(d if _ 1) d« opt )|5 1 = si} < E{y*(dK-2,d^ ) _T 1 4 )opt )l^i = si} 

= E[a^_i{si,5 2 *(di),...,5^_ 1 (dK-2)|Si = si]. (A.9) 
Again, because dx-2 is arbitrary, if we replace it by (dx-3, d^° pt ), the equality in (A.9) implies 

^•(djf-a.^T)!^! = s i} = E[ax-i{si,5 2 *(di), . . . , fi^.^Jx-s, d^_ 2 pt )}|^ = (A.10) 
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where, for any d, d k = (d k , ■ ■ ■ , d#). Using (A. 7) with k = K — 2, (A. 9), and (A. 10), we obtain 

E{Y*(d K ^ 2 ,d^X)\Si = s t } = ElaK-ifaSZfa), S^_ 1 (J K - 2 )}|S 1 = s x ] 

< E[a K -i{si,5 2 *(rfi), . . .,S* K ^(d K ^ 3 ,d^X)\Si = si] = E{Y*(d K . 3 ,d^X)}\Si = fli} 
= E[ajf_2{si, ^ (di), . . . , 5^_ 2 (J/ f -3)}|<5 , i = si]- 

Continuing in this fashion, we may conclude that, for any d £ T>, 

E{y*(d)|5! =«!><•■•< E{F*(4-i,4 1)0Pt )l^i = «!}<•••< E{F*(d^ 1 ^ opt )|S'i = Sl }, 
showing that d^ opt defined in (5) and (7) is an optimal regime satisfying (4). 

A. 2 Demonstration of Correspondence in (20)— (22) Under Assumptions in 
Section 2 

We first consider the case I = 1. We make the positivity assumption that, for any (s k ,d k —i) for which 
pr(5 fc = s fe ,Afe_i = a fe _i) > 0, pr(A fc = a k \S k = s k ,A k -i = Sfe-i) > if and only if a k G * fc (s fc , a fc _i), 
fc = 1, . . . , if . This ensures that the observed data contain information on the treatments involved in the 
class of feasible regimes under consideration. We have T k = T k by definition, so we need only demonstrate 
(21) and (22). We must show that, for any (s k ,a k -i) 6 T k and a k € ty k (s k ,a k -i), k = 1, . . . ,K, 

pr(5 fc = s fc , A k = a k ) > 0, (A.ll) 
pr(S' fc+ i = s k +i\S k = s k ,A k = a k ) = pr{Sl +1 (d k ) = s k+ i\S k = s k ,A k - X = a fe _i}, (A. 12) 
= w{S£+i(ak) =s k+1 \Sj =Sj,Aj-! =a j - l ,Sj +1 (a,j) = s j+1 , . . . , S k (a k -i) = s k }, (A.13) 

for j — 1, . . . , k, where we define (A.13) with j = k to be the same as the expression on the right-hand side 
of (A. 12) and take S k+1 = Y and S* k+1 {a k ) = Y*{a k ). 

Assume for the moment that (A.ll) is true. We now demonstrate (A. 12) and (A.13). For any fixed 
k, by the consistency assumption, the left-hand expression in (A. 12) is equal to pi{S k+1 (a k ) — s k+ i\S k = 
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Sfc,j4fc_i = a,f.—i,Ak = afc}- It follows by the sequential randomization assumption, which implies A k X 
Sl +1 (a,k)\Sk, Ak-i, that this is equal to the right-hand side of (A. 12). The equality in (A. 13) follows 
by induction. Specifically, treating the right-hand side of (A. 12) as (A. 13) with j = k, the equality 
follows if we can show that (A. 13) being true for a given j implies that it is also true for j — 1. For 
a given j = 2,...,fc, by the consistency assumption, (A. 13) is equal to pr{S^ +1 (afc) — Sk+i\Sj-± = 
Sj-i, Aj-2 — a-j-2, Aj-i — a,j—i, Sj(a,j) — Sj, . . . , S k (a k -i) — s k }- By the sequential randomization 
assumption, Aj-\ X {S*(dj), . . . , S k+1 (a k )}\Sj-i, Aj-2, so that this expression is equal to pr{S k+l (a k ) = 
Sk+i\Sj-i = 8j-i,Aj_2 = a>j-2,Sj(a,j) = Sj,...,Sl(a k -i) = s fc }, which is (A. 13) for j - 1. Note, then, 
that this implies that the conditional densities in (A. 13), which are j-dependent, are the same as those on 
the left-hand side of (A. 12), which are not. 

We now prove (A. 11) by induction. Assume we have shown that pv(S k = s kl A k = a k ) > 0. Then we 
must show that pr(S fc+1 = s fe+1 , A k+1 = a k+1 ) > 0. If pr(S k = s kl A k = a k ) > 0, then 

pv(S k+ i = Sk+i,Ak = a k ) = pr(5 fc+ i = s k+ i\S k = s k ,A k = a k )pr(S k = s k ,A k = a k )- (A. 14) 

But we have shown above that if (A. 11) is true; i.e., pr(5fc = s k , Ak — a k ) > 0, then (A. 12) and (A. 13) are 
equal for all j and in particular pr(S k +i = s k+ i\S k = s k ,A k = a k ) = pr{S k+1 (a k ) = s k+ i \S k (a k -i) = s k }. 
Because (s k +i,a k ) £ r fe+ i, then by condition (ii) of (2) defining T k+ i, pT{S k+1 (a k ) = s k +i\S k (a k -i) = 
s k } > because of (A. 14). Now pr(5 fe+ i = s k+1 ,A k+1 = a k +i) = pr(A k+1 = a fe+ i|,5fe + i = s k +i,A k = 
a k )pr(S k +i = s k +i,A k = afc); however, because a k +i £ fy k (sk+i,a>k) and by the positivity assumption, 
pr(A k+ i = a k+1 \S k+ i = s k+ i,A k = a k ) > and hence pr(5 fe+ i = s k+1 ,A k+1 = a k+1 ) > 0. The 
proof is complete by noting that pr(Si = s\,A\ = ai) = pi'(Ai = a±\Si = Si)pr(Si = si), where 
pr(Ai = a\\Si — s\) > for a\ £ Vl'(si) by the positivity assumption. 

To demonstrate (21) and (22) for I = 1, consider first the definitions of d^ opt (sk , a>K-i) an d V k X \sk, o>k-i) 
given in (5) and (G). These quantities involve the conditional expectation of the potential outcome 
^*(ox) given S K (a k _i), which by (A.12)-(A.13) is the same as the conditional expectation of Y given 
{Sk — sk,Ar = Q>k}- Thus, d^ opt (s.K-, a>K-i) an d V k X \sk, o-k-i) are the same as dt^ t (sK > a-K-x) an d 
Vk(sk,o>K-i) defined in (15) and (16). Next, from (7) and (8), d^^^x-i, o>k-2) = ar g max E[Vj^ {sr- 

OK-l6*K-l(sK-li«K-2) 
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sk-i]- This involves the conditional expectation of V K , a function of S k (clk-i), given S K _i{a,K-2) = 
sk-i- Again, by (A.12)-(A.13), this is the same as the conditional expectation of the function Vjp of Sk 
given {Sk = sk, Ak"-i = olk-x}- Because we have already shown that V K ^ is the same as Vk, this implies 
that d K '°l t {s K -i,aK-2) is given by 

arg max E{V k {sk-i, S k , a K -i, a K -i)\S K = s K , A K -i = (a K -2, ajf-2)}, 

»K-l£*K-l(sK-Ii»K-j) 

which is the same as cKP_.,(sk_i, o,k—2) given by (18) with k = K — 1. The argument continues in a 

backward iterative fashion for k = K — 2, . . . , 1. 

Now consider £ > 1. The sets V^fc, £ = 1, ...,K, k = £,..., K, representing events of the form 

{Si = si,A^\ = at-x, Sg, i(ai) = s^+i, . . . , S%.(cLk-i) = Sk}, involved in the definitions of 1^ and 

(10)-(13), depend on the random variables S k , A)^_ x for k = £,..., K, which characterize how treatment 

assignment and covariate history arise in the population under routine practice. To demonstrate (20)-(22), 

in addition to those on the observed random variables given above, we also require sequential randomization 

( p) ~(p) ~(p) 

and positivity assumptions on the "population" random variables; namely, that AY JL W\S k ,A k _ ly 

~(p) ~(p) (p) ~(p) 

k = 1,...,K; and, for any (s k ,a k -i) for which pr(S^ = s k ,Al_\ = a k -i) > 0, pr(A\ = a k \S k = 

sj,1l_[ = a k -\) > if and only if a k G ^ k (s kl a k -i), k — 1, . . . , K. If the observed data are from an 

( p) (p) 

observational study where S k , A k are the same as S k , A k , these assumptions are equivalent to those on 
the observed data. For data from a SMART, however, more consideration is required. If the treatment 
options considered in the trial arc restricted relative to those available in practice, then an estimated optimal 
regime based on the observed data may not be applicable to patients who present at the -^th decision with 
treatment histories involving options not considered in the trial for t > 1. The positivity assumption here 
rules out such patients from consideration. The sequential randomization assumption holds for observed 
data by design for a SMART. However, whether or not it holds in the population, as we require here, 
depends on whether or not the covariate information collected in the trial contains the information used by 
patients and their providers to make treatment decisions in routine practice. If this is not the case, then 
the estimated optimal regime based on the trial data is still applicable to patients who present prior to the 
first decision, 1=1, but may not lead to optimal decision-making for patients presenting at subsequent 
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decision points because the sequential randomization assumption at the population level may no longer 
hold. 

Under these assumptions, it follows by an argument analogous to that above that. ( ;." i) ( \ .i j hold 

( P) (P) 

with the random variables S k , A k replaced by Si ,A) , k = 1, . . . , K; namely 

(A.15) 

a fc ,4-i = a*-i} 5 ( A -!6) 
,S* k (a k _ 1 ) = s k }, (A.17) 

for j = 1, . . . , k. We may then show that (20) holds as follows. Inspection of T k and 1^ shows both sets 
involve the same condition (i). Accordingly, we need only demonstrate that, if condition (ii) in T k holds, 
then so does (ii) in T k , and vice versa. Condition (ii) in 1^ states that pr(V^fc) > 0. Because the set 
Ve,k C {S k (a k —i) = Sk}, condition (ii) in Tk follows immediately. In the converse direction, if (ii) of 1^ 
holds, then (A.15) holds. Because the set {sjf' = s k ,A k P ^ = a k } C V^.fe, pr(V^fc) > 0, which is (ii) of T^. 

Now (21) and (22) follow by an argument similar to that for 1=1. First, we argue that, for any hxed 
k = 1, . . . , K, the probabilities in (A. 12) and (A. 13) are the same as those in (A. 16) and (A.17) for all j = 
1, . . . , k. This follows because (A. 13) with j = 1 is equal to (A.17) with j = 1. We may now use this to show 
the result. Consider the definitions of c4^° pt (sR-, clr -i) and V^\sk, a>K-i) given in (10) and (11). These 
quantities involve the conditional expectation of the potential outcome Y*(Rk) given {S\ = sgjAj,^ = 
at-i, Sjt +1 (ae) = s^+i, . . . , S* k (clk-i) = bk-i}- But, because of the above equivalence of (A.12)-(A.13) and 
(A.16)-(A.17), this is the same as the conditional expectation of Y given {Sk = s~k,Ak = Uk}- Thus, 
d K ^ opt (sK,o,K-i) and Vjf (sk, o-k-i) ar e the same as dg^sjc, 5,k-i) and Vk(Sk, uk-i) defined in (15) and 
(16), and this is true for all t = 2, . . . ,K. Next, in accordance with (21) and (22), d|'°f aK-2) = 

arg max E[vjp{sK-u S* K (a K -2, ajf-i), a K -2, a K -i}\S^ P) = s e ,A^ P \ = at-i, S^ +1 (a e ) = 

si,.. . , S* k _ 1 {clk-2) — sk-i]- Note that this involves the conditional expectation of the function vjp of 
S K (a K -i) given S^ = Si,A\ P \ = at-i, S* t+1 {at) — se, ■ ■ ■ , <S , £-_ 1 (aif_ 2 ) = s K -\- Again, this is the same 
as the conditional expectation of the function vjp of Sk given {Sk — s~k,A~k-i — o,K-i}- Because we 
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pr(^ P) =s k ,A[ P) = a k ) >0, 
P T ( S k+\ = s k+i\S ( k P) = s k ,A[ P) = o fc ) = pr{S k+1 (a k ) = s k+1 \S k P) = 
= pr{S k +i(a-k) = s k+1 \S) P) = Sj,Af\ = aj-^Sj+iiaj) = s j+1 , . . . 



have shown that V K is independent of £ and equal to Vk 7 this implies that 

dx^{s K -i,aK-2) = arg max E{V K (s K -i, S K , a K -2, a K -i)\S K = a K ,A K -i}, 

K-l{sK-l,<lK-2) 

which is the same as ^^(sk-i, aK-2) given by (18) with k = K — 1. The argument continues in an 
backward iterative fashion for k = K — 2, . . . , 1. 

A. 3 Justification for in A-learning 

We wish to show that 

E (^+i(S fc+1 ,^) + C fc (5 fc ,i fc _ 1 )[7{C fe (5 fe ,A fe _ 1 ) >0}-A k ]\g k ,A k _^ = V k {S k ,A k _ x ). (A.18) 

Denning T(S k+ i,A k ) = V k+ i(S k+ i,A k ) + C k {S k ,A k _ 1 )[I{C k (S k ,A k - 1 ) > 0} - A k ], we may write (A.18) 
as 

E[-E{T(S k+1 ,A k )\S k ,A k }\S k ,A k ^]. (A.19) 
The inner expectation in (A.19) may be seen to be equal to 

E{V k+1 {S k+u A k )\S k ,A k } + C k {S k ,A k ^)[I{C k {S k ,A k ^) >0}-A k ] 
= Q k (§k,A k ) + C k (S k ,A k - 1 )[I{C k (S k ,A k - 1 )>0}-A k ]. 

Substituting Q k (S k , A k ) = h k (S k ,A k _i) + A k C k (S k , A k _i), h k {S k ,A k _i) = Q k (S k , A k _ x , 0), we obtain 
E{T(S k+1 ,A k )\S k ,A k } = h k (S k ,A k ^) + C k {S k ,A k _ l )I{C k {S k ,A k _ l ) > 0} = V k {S k ,A k ^). Substituting 
this in (A.19) yields the result. 



40 



A. 4 Demonstration of Equivalence of Q- and A-learning in a Special Case 

We take K = 1 and let pr(Ai = l|Si = s\) = w. Consider the A- learning estimating equations (28) with 
k — 1, and take Ai(si;-0i) = d/dipiC\(si; ipi)- Then the equations become 

^^r^- iMi - r){Yi - AudiSn; V>i) - hx(S u ; 13,)} = 0, 



£ dkli ^ K ) {Y l - AMSu; i>x) ~ htiSu;^)} = 0. 



Likewise, under these conditions, taking Qi(si,ai) = aiC\(si;'ipi) + ^(si ; /3i), the Q-learning equation is 

£ dQ ^ A ^^\ Yl - AMSu-^i) ~ h^M) = o, 
where, with £j = (tpf , f3-f) T , 



dp 



i 



Thus note that, with Ci(si; - 0i) and /ii(si;/3i) linear in functions of Si, as long as terms of the form in 
C\{si]ipi) are contained in those in hi_(s\, , the Q- and ^-learning estimating equations are identical, 
as then 



i= 1 



For example, if Ci(si;^i) = -0io + sfipn and /ii(si;/3i) = /?io + Si then note that 



1 



and the result is immediate. 
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A. 5 Example of Incompatibility of Q-function Models 

To show (30), noting H2 — (1, Si, a\, s 2 ) T = (^1 , s 2 ) T , we have 

E{V r 2 (siS , 2,ai;^ 2 )|S'i = s u Ai = ai} = K\ /3 2 i + /3 2 2E(S , 2 |S'i = si,^i = 01) 
+(/C^ 21 )E{/(/CfV 2 i + 5 2 ^ 22 > 0)|Si = fli.Ai - 01} 
+</> 22 E{S 2 /(/C?>2i + S2V22 > 0)|5i =s u A 1 = at}. 

Taking V22 > 0, we also have I(K\^2\ + S 2 ip22 > 0) = I(S 2 > — fCf ^21/^22), from which it fol- 
lows that E{J(/C?> 21 + ^22 > 0)|5i = «i,i4i = 01} = 1 - *{{-K% V21M2 - K?7)M = 1 - 
$(77) for 77 = -/Cf(V 2 iM2 + 7)/^ Similarly, E{S 2 /(/Cf V21 + ^22 > 0)|5i = s 1 ,A 1 = o x } = 
£{5*2/(52 > — /Cf ^21/^22) I Si = Sx,Ai = ai}. It is straightforward to deduce that this is equal to 
f°° {at + K%i) ip(t) dt = atp{rf) + (fCf -y){l - $(??)}. Using E(5 2 |5i = s 1 ,A 1 = <n) = fCf-y and combining 
yields (30). 

A.6 Calculation of E{H(d opt )} and R(d opt ) 

Calculation for K = 1. We consider the generative data model in Section 5.1 and treatment regimes of the 
form d{si) = di(si) = /(V'io + V-'ii s i > 0) for arbitrary -0 1O , i/'n- It is possible to derive analytically -ff (e?) = 
E{Y*{d)} in this case. Under the generative data model, E{Y*(d)} = E[E{F*(d)|5i}] = E[E{y|S'i,A 1 = 
di(Si)}] = /3 10 + ft iE(5!) + #> 2 E(S?) + E{/(tA 10 + ^1 > 0)(^? + ^lA)}, and Si ~ Normal(0, 1). It 
is straightforward to deduce that E{7(-0 1O + ipuSi > 0)} = pr(5*i > —ipio/ipn) or pr(S*i < —ipio/tpn) as 
V'n > or < 0, which is readily obtained from the standard normal cdf. Likewise, E{S'i/('0io + V'iiS'i > 
0)} = E{S 1 \S 1 > -^ioMi)pr(Si > -V'ioMi) if 1>u > and E{S 1 J(^io + Mi > 0)} = E(Si|5 x < 
— V'io/' ( /'ii)p r ('5'i < — V'io/V'ii) ^ V'n < 0j which are again easily calculated in a manner similar to that 
in Section A. 5. Thus, E{Y* (d° pt )} is obtained by substituting ipi , ip^ in the resulting expression. To 
approximate E{H{d opt )} and hence R(d opt ) for d opt = dg pt or d^ pt , we may use Monte Carlo simulation. 
Specifically, for the 6th of B Monte Carlo data sets, substitute the estimates V'lo.b, ^11, &> say, defining d opt 
for that data set in the expression for E{Y*(d)}, and call the resulting quantity Ub- Then E{H(d opt )} is 
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approximated by B^ 1 Y^,b=i ^b- Combining yields the approximation to R(d opt ). 

Calculation for K — 2. The developments are analogous to those above. We consider the generative 
data model in Section 5.2 and treatment regimes of the form d — (di, d 2 ), where di(si) — /(V'io+V-'ii s i > 0) 
and d 2 (si,s 2 ,a 1 ) = I(tp 20 + ip2ia*i + ^22^2 > 0) for arbitrary ip 10 , ipn,ip20, ip2i, ^22- Here, E{Y*(d)} = 
E(E[E{y*(d)|5 2 *(d),5 1 }|5 1 ]) =E{E(E{Y\S 2 ,S 1 ,A 1 =d 1 (S 1 ),A 2 = d 2 {S 2 ,S 1 ,d 1 (S 1 )}} \s lt A t = di(Sij) } 
Because Si is binary taking values in {0, 1}, E{Y*(d)} = e(e[Y\S 2 , Si, A x = di(0),A 2 = d 2 {S 2 , 0, di(0)} 
0)+e(e[Y\S 2 ,S 1 ,A 1 = d 1 {l),A 2 = d2{5 a ,l,di(l)}] -Si = Ml = di(l)) pr(Si = 1). Under the genera- 
tive model, writing oi = /('0io+' ! /'ii s i > 0) f° r brevity, these expectations are of the form E^E[Y \S 2 , S\,Ai - 
d 1 (s 1 ),A 2 = d 2 {S 2 ,s u d 1 (s 1 )}] St = s 1 ,A 1 = d 1 ( Sl )) = (3 20 + /3 2 Vi + /3° 2 ai + ^101 +$ 4 E{(S 2 |S 1 = 



•S'i 



si, Ai = di(ai)} + /? 2 5 E{S 2 2 |Si = «Mi = di(«i)} + (^0 + ^iai)E{/(^ 20 + V-2i«i + ^2 2 S 2 > 0)|Sj - 
si, Ai = di(si)} + ?/' 22 E{S' 2 /(-02o + ^2101 +ip22S 2 > 0)\Sx = St, Ax = di(si)}, for s x = 0, 1. In the genera- 
tive data model, the conditional distribution of S 2 given S'i , Ai is normal; accordingly, it is straightforward 
to calculate E{flb|Si = *Mi = rfi(si)}, E{S 2 2 |Si = sMi = di(«i)}, E{/(^20 + VfciOi + V>2 2 S 2 > 0)|5i = 
Si,j4i = di(si)}, and E{S 2 I(ip 2 a + ip2iai + ip 22 S 2 > 0)\S\ = Si,A% = di(si)} in a manner analogous to 
those for the case K = 1. Approximation of E{#(d opt )} and hence i?(d opt ) for d opt = dg pt or dj* may 
then be carried out as for the case K = 1. 

Calculation by simulation. When an analytical expression for if(d) = E{Y*(d)} for regimes of a certain 
form d is not available, H (d) for a fixed d may be approximated by simulation using the g-computation 
algorithm of Robins (1986). We demonstrate for K = 2, so that d = (di,d 2 ); the procedure for K = 1 
is then immediate. For total number of simulations B, for each b — 1, . . . , B, the steps are: (i) Generate 
Su from the true distribution of S'i; (ii) generate s 2 b from the true conditional distribution of S 2 given 
S'i = Sib and A\ = di(su,); (iii) evaluate the true E(Y|S 2 = s 2 ,A 2 — a 2 ) at s 2 — s 2 b = (si&,s 2 &) and 
d 2 — [di(sib), d 2 {s 2 b, dx(sib)}], and call the resulting value J7&; and (iv) estimate H(d) — E{Y*(d)} by 
B^ 1 Ylb=x Ub- When d — dg pt or d°4 Pt , one would follow the above procedure for each Monte Carlo data 
set. In each of steps (i)-(iii), it is important to recognize that, while dg 1 * and d ^ are determined by the 
estimated ifi, the distributions from which realizations are generated depend on the true (3 and ip. The 
values of E{iJ(d opt )} and E{ J ff(d° 4 pt )} may then be approximated by the average of the estimated H(d < Q t ) 
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and H(d°Q ) across the Monte Carlo data sets, as before. 

A. 7 Creating "Equivalently Misspecified Pairs" When Both the Propensity 
Model and Q-function are Misspecified 

Consider the K = 2 decision point scenario; the developments apply equally to the K = 1 setting. To 
identify pairs (/?25i025) that are "equivalently misspecified," for each of the combinations of /3°5 and 02 5 
within a pre-specified grid, say (/?257 ( / ) 25) G [ — 1; 1] x [ — 1> 1] with a step size of 0.05, we generate a large 
data set of size n — 10, 000 from the generative data model in Section 5.2 with all other parameters fixed 
at their true values. This yields 41 x 41 = 1681 combinations and hence such data sets. For each data set, 
the linear regression model for the response and the logistic model for propensity of treatment assignment 
are then fitted, and the ratio of standard errors for 025 and S 'E (1P25) / S E (^25) , say, obtained. We 
then fit to these values a polynomial model in 02 5 , f(4>2$)i say, an d select the polynomial degree yielding 
a sufficiently large adjusted R 2 . Setting = ^25/ f ($25) then yields the result that the corresponding 
t-statistics will be approximately equal. These were re-checked in the course of running the simulations so 
that the t-statistics differed by less than some reasonable value, usually at most a 5 percent difference, as 
it cannot be guaranteed that they will be precisely the same. 

A. 8 Derivation of ^i°(si; and (7°(si;V>i) in the Two Decision Point Scenario 

We seek to identify the true /i?(si) and C°(si), where Si and A\ are Bernoulli. With /i?(si) = j3® Q + 
&x±si and Cf(si) = -0io + V'ii s ij h follows that the true (J-function at the first decision is Qi(si,ai) = 
h>i(si) +aiC°(si). We thus calculate Q\{s\, a\) under the generative model and equate terms to determine 
the form of ft , ft ± , ipi , and V'ii- The true value function at the second decision is V^Si, S%, A\) = 
h Q 2 (S 1 ,S2,A 1 ) + C°(S 1 ,S2,A 1 )I{C2 1 (S 1 ,S 2 ,A 1 ) > 0}. Thus, Q§(«i,oi) = E{V?(S U S 2 , Ai)\Si = s u Ai = 
ai} = P% + P21S1 + P>i + ^ 3 s iai + PlMSilSi = 81, A t = 01} + $ 5 V{S%\Si = s u A! = ai} + 
EjCj (Si, S*2, A\)I{C2 (Si, S2, Ai) > 0)|5i = s\,Ai — a{\. The conditional expectations in this expression 
may be calculated in a manner analogous to that in Section A. 5 to obtain the form of Q\(si, ai). It follows 
that Q5(0,0) = p° w , Q?(1,0) = p%+fti, Q°(0,1) = /3? + <o, and Q°(l,l) = 0°, + ft 1 + + ^ 1 , which 
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may be solved to yield expressions for /3° , Z?^, tpi , and ip^. 
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